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Preface 



Since 1993, cryptographic algorithm research has centered around the Fast Soft- 
ware Encryption (FSE) workshop. First held at Cambridge University with 30 
attendees, it has grown over the years and has achieved worldwide recognition 
as a premiere conference. It has been held in Belgium, Israel, France, Italy, and, 
most recently, New York. 

FSE 2000 was the 7th international workshop, held in the United States for 
the first time. Two hundred attendees gathered at the Hilton New York on Sixth 
Avenue, to hear 21 papers presented over the course of three days: 10-12 April 
2000. These proceedings constitute a collection of the papers presented during 
those days. 

FSE concerns itself with research on classical encryption algorithms and re- 
lated primitives, such as hash functions. This branch of cryptography has never 
been more in the public eye. Since 1997, NIST has been shepherding the Advan- 
ced Encryption Standard (AES) process, trying to select a replacement algorithm 
for DES. The first AES conference, held in California the week before Crypto 98, 
had over 250 attendees. The second conference, held in Rome two days before 
FSE 99, had just under 200 attendees. The third AES conference was held in 
conjunction with FSE 2000, during the two days following it, at the same hotel. 

It was a great pleasure for me to organize and chair FSE 2000. We received 53 
submissions covering the broad spectrum of classical encryption research. Each of 
those submissions was read by at least three committee members - more in some 
cases. The committee chose 21 papers to be presented at the workshop. Those 
papers were distributed to workshop attendees in a preproceedings volume. After 
the workshop, authors were encouraged to further improve their papers based 
on comments received. The final result is the proceedings volume you hold in 
your hand. 

To conclude, I would like to thank all the authors who submitted papers 
to this conference, whether or not your papers were accepted. It is your conti- 
nued research that makes this field a vibrant and interesting one. I would like 
to thank the other program committee members: Ross Anderson (Cambridge), 
Eli Biham (Technion), Don Coppersmith (IBM), Cunslreng Ding (Singapore), 
Dieter Gollmann (Microsoft), Lars Knudsen (Bergen), James Massey (Lund), 
Mitsuru Matsui (Mitsubishi), Bart Preneel (K.U. Leuven), and Serge Vaudenay 
(EPFL). They performed the hard and too often thankless - task of selec- 
ting the program. I’d like to thank my assistant, Beth Friedman, who handled 
administrative matters for the conference. And I would like to thank the atten- 
dees for coming to listen, learn, share ideas, and participate in the community. I 
believe that FSE represents the most interesting subgenre within cryptography, 
and that this conference represents the best of what cryptography has to offer. 

Enjoy the proceeedings, and I’ll see everyone next year in Japan. 



August 2000 



Bruce Schneier 
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Real Time Cryptanalysis of A5/1 on a PC 



Alex Biryukov 1 , Adi Shamir 1 , and David Wagner 2 

1 Computer Science department, The Weizmann Institute, Rehovot 76100, Israel 
2 Computer Science department, University of California, Berkeley CA 94720, USA. 



Abstract. A5/1 is the strong version of the encryption algorithm used 
by about 130 million GSM customers in Europe to protect the over- 
the-air privacy of their cellular voice and data communication. The best 
published attacks against it require between 2 40 and 2 45 steps. This le- 
vel of security makes it vulnerable to hardware-based attacks by large 
organizations, but not to software-based attacks on multiple targets by 
hackers. 

In this paper we describe new attacks on A5/1, which are based on subtle 
flaws in the tap structure of the registers, their noninvertible clocking 
mechanism, and their frequent resets. After a 2 48 parallelizable data 
preparation stage (which has to be carried out only once), the actual 
attacks can be carried out in real time on a single PC. 

The first attack requires the output of the A5/1 algorithm during the 
first two minutes of the conversation, and computes the key in about 
one second. The second attack requires the output of the A5/1 algo- 
rithm during about two seconds of the conversation, and computes the 
key in several minutes. The two attacks are related, but use different 
types of time-memory tradeoffs. The attacks were verified with actual 
implementations, except for the preprocessing stage which was extensi- 
vely sampled rather than completely executed. 

REMARK: We based our attack on the version of the algorithm which 
was derived by reverse engineering an actual GSM telephone and pu- 
blished at http://www.scard.org. We would like to thank the GSM 
organization for graciously confirming to us the correctness of this un- 
official description. In addition, we would like to stress that this paper 
considers the narrow issue of the cryptographic strength of A5/1, and 
not the broader issue of the practical security of fielded GSM systems, 
about which we make no claims. 



1 Introduction 

The over-the-air privacy of GSM telephone conversations is protected by the A5 
stream cipher. This algorithm has two main variants: The stronger A5/1 version 
is used by about 130 million customers in Europe, while the weaker A5/2 version 
is used by another 100 million customers in other markets. The approximate 
design of A5/1 was leaked in 1994, and the exact design of both A5/1 and A5/2 
was reverse engineered by Briceno from an actual GSM telephone in 1999 (see 
0 ). 

B. Schneier (Ed.): FSE 2000, LNCS 1978, pp. 1-1171 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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In this paper we develop two new cryptanalytic attacks on A5/1, in which 
a single PC can extract the conversation key in real time from a small amount 
of generated output. The attacks are related, but each one of them optimizes 
a different parameter: The first attack (called the biased birthday attack) 
requires two minutes of data and one second of processing time, whereas the 
second attack (called the the random subgraph attack) requires two seconds 
of data and several minutes of processing time. There are many possible choices 
of tradeoff parameters in these attacks, and three of them are summarized in 
Table d 



Table 1. Three possible tradeoff points in the attacks on A5/1. 



Attack Type Preprocessing Available Number of Attack time 



steps data 73GB disks 



Biased Birthday attack (1) 


2® 


2 minutes 


4 


1 second 


Biased Birthday attack (2) 


2 48 


2 minutes 


2 


1 second 


Random Subgraph attack 


2 48 


2 seconds 


4 


minutes 



Many of the ideas in these two new attacks are applicable to other stream 
ciphers as well, and define new quantifiable measures of security. 

The paper is organized in the following way: Section 2 contains a full descrip- 
tion of the A5/1 algorithm. Previous attacks on A5/1 are surveyed in Section 
3, and an informal description of the new attacks is contained in Section 4. Fi- 
nally, Section 5 contains various implementation details and an analysis of the 
expected success rate of the attacks, based on large scale sampling with actual 
implementations . 



2 Description of the A5/1 Stream Cipher 

A GSM conversation is sent as a sequence of frames every 4.6 millisecond. Each 
frame contains 114 bits representing the digitized A to B communication, and 
114 bits representing the digitized B to A communication. Each conversation 
can be encrypted by a new session key K. For each frame, K is mixed with a 
publicly known frame counter F n , and the result serves as the initial state of a 
generator which produces 228 pseudo random bits. These bits are XOR’ed by 
the two parties with the 114+114 bits of the plaintext to produce the 114+114 
bits of the ciphertext. 

A5/1 is built from three short linear feedback shift registers (LFSR) of lengths 
19, 22, and 23 bits, which are denoted by R1,R2 and R3 respectively. The 
rightmost bit in each register is labelled as bit zero. The taps of R1 are at bit 
positions 13,16,17,18; the taps of R2 are at bit positions 20,21; and the taps of 
R3 are at bit positions 7, 20,21,22 (see Figure Q. When a register is clocked, 
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its taps are XORed together, and the result is stored in the rightmost bit of 
the left-shifted register. The three registers are maximal length LFSR’s with 
periods 2 19 — 1, 2 22 — 1, and 2 23 — 1, respectively. They are clocked in a stop/go 
fashion using the following majority rule: Each register has a single “clocking” 
tap (bit 8 for Rl, bit 10 for i?2, and bit 10 for for i?3); each clock cycle, the 
majority function of the clocking taps is calculated and only those registers whose 
clocking taps agree with the majority bit are actually clocked. Note that at each 
step either two or three registers are clocked, and that each register moves with 
probability 3/4 and stops with probability 1/4. 




m = Majority ( ci, C 2 , C3 ) 

Fig. 1 . The A5/1 stream cipher. 



The process of generating pseudo random bits from the session key K and 
the frame counter F n is carried out in four steps: 

— The three registers are zeroed, and then clocked for 64 cycles (ignoring the 
stop/go clock control). During this period each bit of K (from lsb to msb) 
is XOR’ed in parallel into the lsb’s of the three registers. 

— The three registers are clocked for 22 additional cycles (ignoring the stop/go 
clock control). During this period the successive bits of F n (from lsb to msb) 
are again XOR’ed in parallel into the lsb’s of the three registers. The contents 
of the three registers at the end of this step is called the initial state of the 
frame. 
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— The three registers are clocked for 100 additional clock cycles with the 
stop/go clock control but without producing any outputs. 

— The three registers are clocked for 228 additional clock cycles with the 
stop/go clock control in order to produce the 228 output bits. At each clock 
cycle, one output bit is produced as the XOR of the msb’s of the three 
registers. 

3 Previous Attacks 

The attacker is assumed to know some pseudo random bits generated by A5/1 
in some of the frames. This is the standard assumption in the cryptanalysis of 
stream ciphers, and we do not consider in this paper the crucial issue of how 
one can obtain these bits in fielded GSM systems. For the sake of simplicity, we 
assume that the attacker has complete knowledge of the outputs of the A5/1 al- 
gorithm during some initial period of the conversation, and his goal is to find the 
key in order to decrypt the remaining part of the conversation. Since GSM tele- 
phones send a new frame every 4.6 milliseconds, each second of the conversation 
contains about 2 8 frames. 

At the rump session of Crypto 99, Ian Goldberg and David Wagner anno- 
unced an attack on A5/2 which requires very few pseudo random bits and just 
0(2 16 ) steps. This demonstrated that the “export version” A5/2 is totally inse- 
cure. 

The security of the A5/1 encryption algorithm was analyzed in several papers. 
Some of them are based on the early imprecise description of this algorithm, 
and thus their details have to be slightly modified. The known attacks can be 
summarized in the following way: 

— Briceno0 found out that in all the deployed versions of the A5/1 algorithm, 
the 10 least significant of the 64 key bits were always set to zero. The com- 
plexity of exhaustive search is thus reduced to 0(2 54 ). 0 

— Anderson and RoejTj proposed an attack based on guessing the 41 bits in 
the shorter Ri and R 2 registers, and deriving the 23 bits of the longer R3 
register from the output. However, they occasionally have to guess additional 
bits to determine the majority-based clocking sequence, and thus the total 
complexity of the attack is about 0(2 45 ). Assuming that a standard PC can 
test ten million guesses per second, this attack needs more than a month to 
find one key. 

— GolicpIJ described an improved attack which requires 0(2 40 ) steps. However, 
each operation in this attack is much more complicated, since it is based on 
the solution of a system of linear equations. In practice, this algorithm is not 
likely to be faster than the previous attack on a PC. 



1 Our new attack is not based on this assumption, and is thus applicable to A5/1 
implementations with full 64 bit keys. It is an interesting open problem whether we 
can speed it up by assuming that 10 key bits are zero. 
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— Golic [Ij describes a general time-memory tradeoff attack on stream ciphers 
(which was independently discovered by Babbage 0 two years earlier), and 
concludes that it is possible to find the A5/1 key in 2 22 probes into random 
locations in a precomputed table with 2 42 1 28 bit entries. Since such a table 
requires a 64 terabyte hard disk, the space requirement is unrealistic. Al- 
ternatively, it is possible to reduce the space requirement to 862 gigabytes, 
but then the number of probes increases to 0(2 28 ). Since random access to 
the fastest commercially available PC disks requires about 6 milliseconds, 
the total probing time is almost three weeks. In addition, this tradeoff point 
can only be used to attack GSM phone conversations which last more than 
3 hours, which again makes it unrealistic. 



4 Informal Description of the New Attacks 

We start with an executive summary of the key ideas of the two attacks. More 
technical descriptions of the various steps will be provided in the next section. 

Key idea 1: Use the Golic time-memory tradeoff. The starting point 
for the new attacks is the time-memory tradeoff described in Golic [3], which is 
applicable to any cryptosystem with a relatively small number of internal states. 
A5/1 has this weakness, since it has n = 2 64 states defined by the 19+22+23 = 64 
bits in its three shift registers. The basic idea of the Golic time-memory tradeoff 
is to keep a large set A of precomputed states on a hard disk, and to consider the 
large set B of states through which the algorithm progresses during the actual 
generation of output bits. Any intersection between A and B will enable us to 
identify an actual state of the algorithm from stored information. 

Key idea 2: Identify states by prefixes of their output sequences. 
Each state defines an infinite sequence of output bits produced when we start 
clocking the algorithm from that state. In the other direction, states are usually 
uniquely defined by the first log(n) bits in their output sequences, and thus 
we can look for equality between unknown states by comparing such prefixes 
of their output sequences. During precomputation, pick a subset A of states, 
compute their output prefixes, and store the (prefix, state) pairs sorted into 
increasing prefix values. Given actual outputs of the A5/1 algorithm, extract 
all their (partially overlapping) prefixes, and define B as the set of their cor- 
responding (unknown) states. Searching for common states in A and B can be 
efficiently done by probing the sorted data A on the hard disk with prefix queries 
from B. 

Key idea 3: A5/1 can be efficiently inverted. As observed by Golic, 
the state transition function of A5/1 is not uniquely invertible: The majority 
clock control rule implies that up to 4 states can converge to a common state 
in one clock cycle, and some states have no predecessors. We can run A5/1 
backwards by exploring the tree of possible predecessor states, and backtracking 
from dead ends. The average number of predecessors of each node is 1, and thus 
the expected number of vertices in the first k levels of each tree grows only 
linearly in k (see [3]). As a result, if we find a common state in the disk and data, 
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we can obtain a small number of candidates for the initial state of the frame. 
The weakness we exploit here is that due to the frequent reinitializations there 
is a very short distance from intermediate states to initial states. 

Key idea 4: The key can be extracted from the initial state of any 
frame. Here we exploit the weakness of the A5/1 key setup routine. Assume that 
we know the state of A5/1 immediately after the key and frame counter were 
used, and before the 100 mixing steps. By running backwards, we can eliminate 
the effect of the known frame counter in a unique way, and obtain 64 linear 
combinations of the 64 key bits. Since the tree exploration may suggest several 
keys, we can choose the correct one by mixing it with the next frame counter, 
running A5/1 forward for more than 100 steps, and comparing the results with 
the actual data in the next frame. 

Key idea 5: The Golic attack on A5/1 is marginally impractical. 

By the well known birthday paradox, A and B are likely to have a common 
state when their sizes a and b satisfy a * b « n. We would like a to be bounded 
by the size of commercially available PC hard disks, and b to be bounded by 
the number of overlapping prefixes in a typical GSM telephone conversation. 
Reasonable bounds on these values (justified later in this paper) are a ss 2 35 and 
b « 2 22 . Their product is 2 57 , which is about 100 times smaller than n = 2 64 . To 
make the intersection likely, we either have to increase the storage requirement 
from 150 gigabytes to 15 terabytes, or to increase the length of the conversation 
from two minutes to three hours. Neither approach seems to be practical, but the 
gap is not huge and a relatively modest improvement by two orders of magnitude 
is all we need to make it practical. 

Key idea 6: Use special states. An important consideration in imple- 
menting time-memory tradeoff attacks is that access to disk is about a million 
times slower than a computational step, and thus it is crucial to minimize the 
number of times we look for data on the hard disk. An old idea due to Ron 
Rivest is to keep on the disk only special states which are guaranteed to produce 
output bits starting with a particular pattern a of length k, and to access the 
disk only when we encounter such a prefix in the data. This reduces the number 
b of disk probes by a factor of about 2 k . The number of points a we have to 
memorize remains unchanged, since in the formula a * b ss n both b and n are 
reduced by the same factor 2 k . The downside is that we have to work 2 k times 
harder during the preprocessing stage, since only 2~ k of the random states we 
try produce outputs with such a k bit prefix. If we try to reduce the number of 
disk access steps in the time memory attack on A5/1 from 2 22 to 2 6 , we have 
to increase the preprocessing time by a factor of about 64,000, which makes it 
unpractically long. 

Key idea 7: Special states can be efficiently sampled in A5/1. A 

major weakness of A5/1 which we exploit in both attacks is that it is easy 
to generate all the states which produce output sequences that start with a 
particular A-bit pattern a with k = 16 without trying and discarding other states. 
This is due to a poor choice of the clocking taps, which makes the register bits 
that affect the clock control and the register bits that affect the output unrelated 
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for about 16 clock cycles, so we can choose them independently. This easy access 
to special states does not happen in good block ciphers, but can happen in stream 
ciphers due to their simpler transition functions. In fact, the maximal value of 
k for which special states can be sampled without trial and error can serve as a 
new security measure for stream ciphers, which we call its sampling resistance. 
As demonstrated in this paper, high values of k can have a big impact on the 
efficiency of time-memory tradeoff attacks on such cryptosystems. 

Key idea 8: Use biased birthday attacks. The main idea of the first 
attack is to consider sets A and B which are not chosen with uniform probability 
distribution among all the possible states. Assume that each state s is chosen 
for A with probability Pa(s), and is chosen for B with probability Pb(s). If the 
means of these probability distributions are a/n and b/n, respectively, then the 
expected size of A is a, and the expected size of B is b. 

The birthday threshold happens when Pa(s)Pb(s) « 1. For independent 
uniform distributions, this evaluates to the standard condition a*b ss n. However, 
in the new attack we choose states for the disk and states in the data with 
two non-uniform probability distributions which have strong positive correlation. 
This makes our time memory tradeoff much more efficient than the one used by 
Golic. This is made possible by the fact that in A5/1, the initial state of each 
new frame is rerandomized very frequently with different frame counters. 

Key idea 9: Use Heilman’s time-memory tradeoff on a subgraph of 
special states. The main idea of the second attack (called the random subgraph 
attack) is to make most of the special states accessible by simple computations 
from the subset of special states which are actually stored in the hard disk. The 
first occurrence of a special state in the data is likely to happen in the first two 
seconds of the conversation, and this single occurrence suffices in order to locate 
a related special state in the disk even though we are well below the threshold 
of either the normal or the biased birthday attack. The attack is based on a new 
function / which maps one special state into another special state in an easily 
computable way. This / can be viewed as a random function over the subspace 
of 2 48 special states, and thus we can use Heilman’s time-memory tradeoff[4] in 
order to invert it efficiently. The inverse function enables us to compute special 
states from output prefixes even when they are not actually stored on the hard 
disk, with various combinations of time T and memory M satisfying M\/T = 2 48 . 
If we choose M — 2 36 , we get T = 2 24 , and thus we can carry out the attack 
in a few minutes, after a 2 48 preprocessing stage which explores the structure of 
this function /. 

Key idea 10: A5/1 is very efficient on a PC. The A5/1 algorithm was 
designed to be efficient in hardware, and its straightforward software implemen- 
tation is quite slow. To execute the preprocessing stage, we have to run it on 
a distributed network of PC’s up to 2 48 times, and thus we need an extremely 
efficient way to compute the effect of one clock cycle on the three registers. 

We exploit the following weakness in the design of A5/1: Each one of the 
three shift registers is so small that we can precompute all its possible states, 
and keep them in RAM as three cyclic arrays, where successive locations in each 
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array represent successive states of the corresponding shift register. In fact, we 
don’t have to keep the full states in the arrays, since the only information we 
have to know about a state is its clocking tap and its output tap. A state can 
thus be viewed as a triplet of indices (i, j, k) into three large single bit arrays 
(see Figure [21). A\{i), A 2 (j), A^k) are the clocking taps of the current state, and 
Ai(i — 11), A- 2 (j — 12 ),A 3 (k — 13) are the output taps of the current state (since 
these are the corresponding delays in the movement of clocking taps to output 
taps when each one of the three registers is clocked) . Since there is no mixing of 
the values of the three registers, their only interaction is in determining which 
of the three indices should be incremented by 1. This can be determined by a 
precomputed table with three input bits (the clocking taps) and three output 
bits (the increments of the three registers). When we clock A5/1 in our software 
implementation, we don’t shift registers or compute feedbacks - we just add a 
0/1 vector to the current triplet of indices. A typical two dimensional variant 
of such movement vectors in triplet space is described in Figure 0 Note the 
local tree structure determined by the deterministic forward evaluation and the 
nondeterministic backward exploration in this triplet representation. 

Since the increment table is so small, we can expand the A tables from bits to 
bytes, and use a larger precomputed table with 2 24 entries, whose inputs are the 
three bytes to the right of the clocking taps in the three registers, and outputs 
are the three increments to the indices which allow us to jump directly to the 
state which is 8 clock cycles away. The total amount of RAM needed for the 
state arrays and precomputed movement tables is less than 128 MB, and the 
total cost of advancing the three registers for 8 clock cycles is one table lookup 
and three integer additions! A similar table lookup technique can be used to 
compute in a single step output bytes instead of output bits, and to speed up 
the process of running A5/1 backwards. 




k 



i 



R1 



R2 



R3 



Fig. 2. Triplet representation of a state. 
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r a, j+h k+i) 

I (i+h j, k+1) 
\ (i+h k+1, j) 

\ (i+lj+l,k+l) 



Fig. 3. The state-transition graph in the triplet representation of A5/1. 



5 Detailed Description of the Attacks 

In this section we fill in the missing details, and analyse the success rate of the 
new attacks. 

5.1 Efficient Sampling of Special States 

Let a be any 16 bit pattern of bits. To simplify the analysis, we prefer to use an 
a which does not coincide with shifted versions of itself (such as a = 1000... 0) 
since this makes it very unlikely that a single 228-bit frame contains more than 
one occurrence of a. 

The total number of states which generate an output prefix of a is about 
2 64 * 2” 16 = 2 48 . We would like to generate all of them in a (barely doable) 
2 48 preprocessing stage, without trying all the 2 64 possible states and discarding 
the vast majority which fail the test. The low sampling resistance of A5/1 is 
made possible by several flaws in its design, which are exploited in the following 
algorithm: 

— Pick an arbitrary 19-bit value for the shortest register Rl. Pick arbitrary 
values for the rightmost 11 bits in R2 and R3 which will enter the clock 
control taps in the next few cycles. We can thus define 2 19+11+n = 2 41 
partial states. 
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— For each partial state we can uniquely determine the clock control of the 
three registers for the next few cycles, and thus determine the identity of the 
bits that enter their msb’s and affect the output. 

— Due to the majority clock control, at least one of R2 and R3 shifts a new 
(still unspecified) bit into its msb at each clock cycle, and thus we can make 
sure that the computed output bit has the desired value. Note that about 
half the time only one new bit is shifted (and then its choice is forced), and 
about half the time two new bits are shifted (and then we can choose them in 
two possible ways). We can keep this process alive without time consuming 
trial and error as long as the clock control taps contain only known bits 
whereas the output taps contain at least one unknown bit. A5/1 makes this 
very easy, by using a single clocking tap and placing it in the middle of each 
register: We can place in R2 and R3 11 specified bits to the right of the clock 
control tap, and 11-12 unspecified bits to the right of the output tap. Since 
each register moves only 3/4 of the time, we can keep this process alive for 
about 16 clock cycles, as desired. 

— This process generates only special states, and cannot miss any special state 
(if we start the process with its partial specification, we cannot get into 
an early contradiction). We can similarly generate any number c < 2 48 of 
randomly chosen special states in time proportional to c. As explained later 
in the paper, this can make the preprocessing faster, at the expense of other 
parameters in our attack. 



5.2 Efficient Disk Probing 

To leave room for a sufficiently long identifying prefix of 35 bits after the 16-bit a, 
we allow it to start only at bit positions 1 to 177 in each one of the given frames 
(i.e. , at a distance of 101 to 277 from the initial state). The expected number of 
occurrences of a in the data produced by A5/1 during a two minute conversation 
is thus 2 ~ 16 * 177 * 120 * 1000/4.6 ss 71. This is the expected number of times 
b we access the hard disk. Since each random access takes about 6 milliseconds, 
the total disk access time becomes negligible (about 0.4 seconds). 

5.3 Efficient Disk Storage 

The data items we store on the disk are (prefix, state) pairs. The state of A5/1 
contains 64 bits, but we keep only special states and thus we can encode them 
efficiently with shorter 48 bit names, by specifying the 41 bits of the partial state 
and the ss 7 choice bits in the sampling procedure. We can further reduce the 
state to less than 40 bits (5 bytes) by leaving some of the 48 bits unspecified. This 
saves a considerable fraction of the disk space prepared during preprocessing, 
and the only penalty is that we have to try a small number of candidate states 
instead of one candidate state for each one of the 71 relevant frames. Since this 
part is so fast, even in its slowed down version it takes less than a second. 

The output prefix produced from each special state is nominally of length 
16+35=51 bits. However, the first 16 bits are always the constant cc, and the 
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next 35 bits are stored in sorted order on the disk. We can thus store the full 
value of these 35 bits only once per sector, and encode on the disk only their 
small increments (with a default value of 1) . Other possible implementations are 
to use the top parts of the prefixes as direct sector addresses or as file names. 
With these optimizations, we can store each one of the sorted (prefix, state) 
pairs in just 5 bytes. The largest commercially available PC hard disks (such as 
IBM Ultrastar 72 ZX or Seagate Cheetah 73) have 73 gigabytes. By using two 
such disks, we can store 146 * 2 30 /5 ~ 2 35 pairs during the preprocessing stage, 
and characterize each one of them by the (usually unique) 35-bit output prefix 
which follows a. 

5.4 Efficient Tree Exploration 

The forward state-transition function of A5/1 is deterministic, but in the reverse 
direction we have to consider four possible predecessors. About 3/8 of the states 
have no predecessors, 13/32 of the states have one predecessor, 3/32 of the states 
have two predecessors, 3/32 of the states have three predecessors, and 1/32 of 
the states have four predecessors. 

Since the average number of predecessors is 1, Golic assumed that a good 
statistical model for the generated trees of predecessors is the critical branching 
process (see [3]). We were surprised to discover that in the case of A5/1, there 
was a very significant difference between the predictions of this model and our 
experimental data. For example, the theory predicted that only 2% of the sta- 
tes would have some predecessor at depth 100, whereas in a large sample of 
100,000,000 trees we generated from random A5/1 states the percentage was 
close to 15%. Another major difference was found in the tail distributions of the 
number of sons at depth 100: Theory predicted that in our sample we should see 
some cases with close to 1000 sons, whereas in our sample we never saw trees 
with more than 120 sons at depth 100. 




100 steps 



177 steps 



5.5 The Biased Birthday Attack 

To analyse the performance of our biased birthday attack, we introduce the 
following notation: 
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Definition 1 A state s is coloured red, if the sequence of output bits produced 
from state s starts with a (i.e., it is a special state). The subspace of all the red 
states is denoted by R. 



Definition 2 A state is coloured green, if the sequence of output bits produced 
from state s contains an occurrence of a which starts somewhere between bit 
positions 101 and 211 The subspace of all the green states is denoted by G. 

The red states are the states that we keep in the disk, look for in the data, 
and try to collide by comparing their prefixes. The green states are all the states 
that could serve as initial states in frames that contain a. Non-green initial states 
are of no interest to us, since we discard the frames they generate from the actual 
data. 

The size of R is approximately 2 48 , since there are 2 64 possible states, and the 
probability that a occurs right at the beginning of the output sequence is 2 -16 . 
Since the redness of a state is not directly related to its separate coordinates i, 
j, k in the triplet space, the red states can be viewed as randomly and sparsely 
located in this representation. The size of G is approximately 177 * 2 48 (which is 
still a small fraction of the state space) since a has 177 opportunities to occur 
along the output sequence. 

Since a short path of length 277 in the output sequence is very unlikely to 
contain two occurrences of a , the relationship between green and red states is 
essentially many to one: The set of all the relevant states we consider can be 
viewed as a collection of disjoint trees of various sizes, where each tree has a red 
state as its root and a “belt” of green states at levels 101 to 277 below it (see 
Figure 0). The weight W(s) of a tree whose root is the red state s is defined as 
the number of green states in its belt, and s is called fc-heavy if W(s) > k. 

The crucial observation which makes our biased birthday attack efficient is 
that in A5/1 there is a huge variance in the weights of the various red states. We 
ran the tree exploration algorithm on 100,000,000 random states and computed 
their weights. We found out that the weight of about 85% of the states was zero, 
because their trees died out before reaching depth 100. Other weights ranged all 
the way from 1 to more than 26,000. 

The leftmost graph of Figure El describes for each x which is a multiple of 100 
the value y which is the total weight of all the trees whose weights were between 
x and x + 100. The total area under the graph to the right of x = k represents 
the total number of green states in all the fc-heavy trees in our sample. 

The initial mixing of the key and frame number, which ignores the usual clock 
control and flips the least significant bits of the registers about half the time 
before shifting them, can be viewed as random jumps with uniform probability 
distribution into new initial states: even a pair of frame counters with Hamming 
distance 1 can lead to far away initial states in the triplet space. When we 
restrict our attention to the frames that contain a , we get a uniform probability 
distribution over the green states, since only green states can serve as initial 
states in such frames. 
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The red states, on the other hand, are not encountered with uniform proba- 
bility distribution in the actual data. For example, a red state whose tree has 
no green belt will never be seen in the data. On the other hand, a red state 
with a huge green belt has a huge number of chances to be reached when the 
green initial state is chosen with uniform probability distribution. In fact the 
probability of encountering a particular red state s in a particular frame which 
is known to contain a is the ratio of its weight W (s) and the total number of 
green states 177 * 2 48 , and the probability of encountering it in one of the 71 
relevant frames is Pb(s) = 71 * W(s)/(177 * 2 48 ). 

Since Pg(s) has a huge variance, we can maximize the expected number of 
collisions ^ s Pa(s) * Pb{s) by choosing red points for the hard disk not with 
uniform probability distribution, but with a biased probability Pa(s) which ma- 
ximizes the correlation between these distributions, while minimizing the expec- 
ted size of A. The best way to do this is to keep on the disk only the heaviest 
trees. In other words, we choose a threshold number k, and define Pa{s) = 0 if 
W(s) < k, and Pa{s) = 1 if W(s) > k. We can now easily compute the expected 
number of collisions by the formula: 

Y p a(s)*P b (s) = Y 71 * W(s)/(177 * 2 48 ) 

s s\W (s)>k 

which is just the number of red states we keep on the disk, times the average 
weight of their trees, times 71/(177 * 2 48 ). 

In our actual attack, we keep 2 35 red states on the disk. This is a 2 -13 fraction 
of the 2 48 red states. With such a tiny fraction, we can choose particularly heavy 
trees with an average weight of 12,500. The expected number of colliding red 
states in the disk and the actual data is 2 35 * 12, 500 * 71/ (177 * 2 48 ) ss 0.61. This 
expected value makes it quite likely that a collision will actually exist. 0 

The intuition behind the biased time memory tradeoff attack is very simple. 
We store red states, but what we really want to collide are the green states in 
their belts (which are accessible from the red roots by an easy computation). 
The 71 green states in the actual data are uniformly distributed, and thus we 
want to cover about 1% of the green area under the curve in the right side of 
Figure Q Standard time memory tradeoff attacks store random red states, but 
each stored state increases the coverage by just 177 green states on average. With 
our optimized choice in the preprocessing stage, each stored state increases the 
coverage by 12,500 green states on average, which improves the efficiency of the 
attack by almost two orders of magnitude. 

5.6 Efficient Determination of Initial States 

One possible disadvantage of storing heavy trees is that once we find a collision, 
we have to try a large number of candidate states in the green belt of the colliding 

2 Note that in time memory tradeoff attacks, it becomes increasingly expensive to push 
this probability towards 1, since the only way to guarantee success is to memorize 
the whole state space. 
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red state. Since each green state is only partially specified in our compact 5-byte 
representation, the total number of candidate green states can be hundreds of 
thousands, and the real time part of the attack can be relatively slow. 

However, this simple estimate is misleading. The parasitic red states obtai- 
ned from the partial specification can be quickly discarded by evaluating their 
outputs beyond the guaranteed occurrence of a and comparing it to the bits in 
the given frame. In addition, we know the exact location of a in this frame, and 
thus we know the exact depth of the initial state we are interested in within the 
green belt. As a result, we have to try only about 70 states in a cut through the 
green belt, and not the 12,500 states in the full belt. 

5.7 Reducing the Preprocessing Time of the Biased Birthday 
Attack 

The 2 48 complexity of the preprocessing stage of this attack can make it too 
time consuming for a small network of PC’s. In this section we show how to 
reduce this complexity by any factor of up to 1000, by slightly increasing either 
the space complexity or the length of the attacked conversation. 

The efficient sampling procedure makes it possible to generate any number 
c < 2 48 of random red states in time proportional to c. To store the same number 
of states in the disk, we have to choose a larger fraction of the tested trees, which 
have a lower average weight, and thus a less efficient coverage of the green states. 
Table E] describes the average weight of the heaviest trees for various fractions of 
the red states, which was experimentally derived from our sample of 100,000,000 
A5/1 trees. This table can be used to choose the appropriate value of k in the 



Table 2. The average weight of the heaviest trees for various fractions of R. 



Average Weights 

2~ 4 2432 2 ~ b 3624 2 ~ b 4719 2 _: ' 5813 
2“ 8 6910 2 -9 7991 2" 10 9181 2 ~ n 10277 
2 -12 11369 2 -13 12456 2” 14 13471 2' 15 14581 
2 " 1<s 15686 2 ~ r ' 16839 2 " 1B 17925 2~ 19 19012 
2 -20 20 152 2 -21 2 1 22 7 2 " 22 2 2 2 0 9 2 ” 23 2 35 15 
2" 24 2 4 5 9 7 2 " 25 25690 2" 26 2 6 2 34 



definition the fc-heavy trees for various choices of c. The implied tradeoff is 
very favorable: If we increase the fraction from 2 -13 to 2~ 7 , we can reduce the 
preprocessing time by a factor of 64 (from 2 48 to 2 42 ), and compensate by either 
doubling the length of the attacked conversation from 2 minutes to 4 minutes, 
or doubling the number of hard disks from 2 to 4. The extreme point in this 
tradeoff is to store in the disk all the sampled red states with nonzero weights 
(the other sampled red states are just a waste of space, since they will never 
be seen in the actual data). In A5/1 about 15% of the red states have nonzero 
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weights, and thus we have to sample about 2 38 red states in the preprocessing 
stage in order to find the 15% among them (about 2 35 states) which we want 
to store, with an average tree weight of 1180. To keep the same probability of 
success, we have to attack conversations which last about half an hour. 

A further reduction in the complexity of the preprocessing stage can be ob- 
tained by the early abort strategy: Explore each red state to a shallow depth, 
and continue to explore only the most promising candidates which have a large 
number of sons at that depth. This heuristic does not guarantee the existence 
of a large belt, but there is a clear correlation between these events. 

To check whether the efficiency of our biased birthday attack depends on the 
details of the stream cipher, we ran several experiments with modified variants 
of A5/1. In particular, we concentrated on the effect of the clock control rule, 
which determines the noninvertibility of the model. For example, we hashed the 
full state of the three registers and used the result to choose among the four 
possible majority-like movements (+1,+1,+1), (+1,+1,0), (+1,0, +1), (0,+l,+l) 
in the triplet space. The results were very different from the real majority rule. 
We then replaced the majority rule by a minority rule (if all the clocking taps 
agree, all the registers move, otherwise only the minority register moves). The 
results of this minority rule were very similar to the majority-like hashing case, 
and very different from the real majority case (see Figure EJ. It turns out that 
in this sense A5/1 is actually stronger than its modified versions, but we do 
not currently understand the reason for this strikingly different behavior. We 
believe that the type of data in Table El which we call the tail coverage of 
the cryptosystem, can serve as a new security measure for stream ciphers with 
noninvertible state transition functions. 



5.8 Extracting the Key from a Single Red State 

The biased birthday attack was based on a direct collision between a state in 
the disk and a state in the data, and required ss 71 red states from a relatively 
long (ps 2 minute) prefix of the conversation. In the random subgraph attack we 
use indirect collisions, which make it possible to find the key with reasonable 
probability from the very first red state we encounter in the data, even though 
it is unlikely to be stored in the disk. This makes it possible to attack A5/1 
with less than two seconds of available data. The actual attack requires several 
minutes instead of one second, but this is still a real time attack on normal 
telephone conversations. 

The attack is based on Heilman’s original time-memory tradeoff for block 
ciphers, described in [4]. Let E be an arbitrary block cipher, and let P be some 
fixed plaintext. Define the function / from keys K to ciphertexts C by f(K ) = 
Ek(P)- Assuming that all the plaintexts, ciphertexts and keys have the same 
binary size, we can consider / as a random function (which is not necessarily 
one-to-one) over a common space U. This function is easy to evaluate and to 
iterate but difficult to invert, since computing the key K from the ciphertext 
f(K) = Ek{P) is essentially the problem of chosen message cryptanalysis. 
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Heilman’s idea was to perform a precomputation in which we choose a large 
number to of random start points in U, and iterate / on each one of them t 
times. We store the m (start point, end point) pairs on a large disk, sorted into 
increasing endpoint order. If we are given f{K) for some unknown K which is 
located somewhere along one of the covered paths, we can recover K by repea- 
tedly applying / in the easy forward direction until we hit a stored end point, 
jump to its corresponding start point, and continue to apply f from there. The 
last point before we hit f(K) again is likely to be the key K which corresponds 
to the given ciphertext f{K). 

Since it is difficult to cover a random graph with random paths in an effi- 
cient way, Heilman proposed a rerandomization technique which creates multiple 
variants of / (e.g., by permuting the order of the output bits of /). We use t 
variants fi, and iterate each one of them t times on to random start points to 
get to corresponding end points. If the parameters m and t satisfy rrii 2 = \U\, 
then each state is likely to be covered by one of the variants of /. Since we have 
to handle each variant separately (both in the preprocessing and in the actual 
attack), the total memory becomes M = mt and the total running time becomes 
T = f 2 , where M and T can be anywhere along the tradeoff curve M \/ T = \U\. 
In particular, Heilman suggests using M = T = |/7| 2 / 3 . 

A straightforward application of this M \/T = \U\ tradeoff to the |f7| = 2 64 
states of A5/1 with the maximal memory M = 2 36 requires time T = 2 56 , which 
is much worse than previously known attacks. The basic idea of the new random 
subgraph attack is to apply the time-memory tradeoff to the subspace R of 2 48 
red states, which is made possible by the fact that it can be efficiently sampled. 
Since T occurs in the tradeoff formula M\/T = \U\ with a square root, reducing 
the size of the graph by a modest 2 16 (from \U\ = 2 64 to |I?| = 2 48 ) and using 
the same memory (M = 2 36 ), reduces the time by a huge factor of 2 32 (from 
T = 2 56 to just T = 2 24 ). This number of steps can be carried out in several 
minutes on a fast PC. 

What is left is to design a random function / over R whose output-permuted 
variants are easy to evaluate, and for which the inversion of any variant yields 
the desired key. Each state has a “full name” of 64 bits which describes the 
contents of its three registers. However, our efficient sampling technique enables 
us to give each red state a “short name” of 48 bits (which consists of the partial 
contents of the registers and the random choices made during the sampling 
process), and to quickly translate short names to full names. In addition, red 
states are characterized (almost uniquely) by their “output names” defined as 
the 48 bits which occur after a in their output sequences. We can now define 
the desired function / over 48-bit strings as the mapping from short names to 
output names of red states: Given a 48-bit short name x, we expand it to the 
full name of a red state, clock this state 64 times, delete the initial 16-bit a, 
and define f{x) as the remaining 48 output bits. The computation of f{x) from 
x can be efficiently done by using the previously described precomputed tables, 
but the computation of x from f(x) is exactly the problem of computing the 
(short) name of an unknown red state from the 48 output bits it produces after 
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a. When we consider some output-permuted variant /* of /, we obviously have 
to apply the same permutation to the given output sequence before we try to 
invert /* over it. 

The recommended preprocessing stage stores 2 12 tables on the hard disk. 
Each table is defined by iterating one of the variants fi 2 12 times on 2 24 randomly 
chosen 48-bit strings. Each table contains 2 24 (start point, end point) pairs, but 
implicitly covers about 2 36 intermediate states. The collection of all the 2 12 tables 
requires 2 36 disk space, but implicitly covers about 2 48 red states. 

The simplest implementation of the actual attack iterates each one of the 
2 12 variants of / separately 2 12 times on appropriately permuted versions of the 
single red state we expect to find in the 2 seconds of data. After each step we 
have to check whether the result is recorded as an end point in the corresponding 
table, and thus we need T = 2 24 probes to random disk locations. At 6 ms 
per probe, this requires more than a day. However, we can again use Rivest’s 
idea of special points: We say that a red state is bright if the first 28 bits of 
its output sequence contain the 16-bit a extended by 12 additional zero bits. 
During preprocessing, we pick a random red start point, and use fi to quickly 
jump from one red state to another. After approximately 2 12 jumps, we expect 
to encounter another bright red state, at which we stop and store the pair of 
(start point, end point) in the hard disk. In fact, each end point consists of a 
28 bit fixed prefix followed by 36 additional bits. As explained in the previous 
attack, we do not have to store either the prefix (which is predictable) or the 
suffix (which is used as an index) on the hard disk, and thus we need only half 
the expected storage. We can further reduce the required storage by using the 
fact that the bright red states have even shorter short names than red states (36 
instead of 48 bits), and thus we can save 25% of the space by using bright red 
instead of red start points in the table. 0 During the actual attack, we find the 
first red state in the data, iterate each one of the 2 12 variants of / over it until 
we encounter a bright red state, and only then search this state among the pairs 
stored in the disk. We thus have to probe the disk only once in each one of the 
t = 2 12 tables, and the total probing time is reduced to 24 seconds. 

There are many additional improvement ideas and implementation details 
which will be described in the final version of this paper. 
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3 Note that we do not know how to jump in a direct way from one bright red state 
to another, since we do not know how to sample them in an efficient way. We have 
to try about 2 12 red states in order to find one bright red start point, but the total 
time needed to find the 2 36 bright red start points in all the tables is less than the 
2 48 complexity of the path evaluations during the preprocessing stage. 
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Fig. 5. Weight distributions. The graph on the left shows weight distribution for the 
majority function; the graph on the right compares the weight distributions of several 
clock-control functions. 
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Abstract. The alleged RC4 keystream generator is examined, and a 
method of explicitly computing digraph probabilities is given. Using this 
method, we demonstrate a method for distinguishing 8-bit RC4 from 
randomness. Our method requires less keystream output than currently 
published attacks, requiring only 2 30 ’ 6 bytes of output. In addition, we 
observe that an attacker can, on occasion, determine portions of the in- 
ternal state with nontrivial probability. However, we are currently unable 
to extend this observation to a full attack. 



1 Introduction 

We show an algorithm for deriving the exact probability of a digraph in the 
output of the alleged RC4 stream cipher. This algorithm has a running time 
of approximately 2 , where n is the number of bits in a single output. Using 

the computed probabilities of each digraph for the case that n = 5, we discern 
which digraphs have probabilities furthest from the value expected from a uni- 
form random distribution of digraphs. Extrapolating this knowledge, we show 
how to distinguish the output of the alleged RC4 cipher with n = 8 from ran- 
domness with 2 30 6 outputs. This result improves on the best known method 
of distinguishing that cipher from a truly random source. In addition, heuristic 
arguments about the cause of the observed anomalies in the digraph distribution 
are offered. 

The irregularities in the digraph distribution that we observed allow the 
recovery of n and i parameters (defined in Section EJ if the attacker happens not 
to know them. Also, an attacker can use this information in a ciphertext-only 
attack, to reduce the uncertainty in a highly redundant unknown plaintext. 

We also observe how an attacker can learn, with nontrivial probability, the 
value of some internal variables at certain points by observing large portions of 
the keystream. We are unable to derive the entire state from this observation, 
though with more study, this insight might lead to an exploitable weakness in 
the cipher. 

This paper is structured as follows. In Section El the alleged RC4 cipher is 
described, and previous analysis and results are summarized. Section 0 presents 
our analysis of that cipher, and Section ^investigates the mechanisms behind the 
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statistical anomalies that we observe in that cipher. Section0examines fortuitous 
states, which allow the attacker to deduce parts of the internal state. In Section 
0 extensions of our analysis and directions for future work are discussed. Section 
[7]summarizes our conclusions. Lastly, the Appendix summarizes the results from 
information theory that are needed to put a strong bound the effectiveness of 
tests based on the statistical anomalies, and presents those bounds for our work 
and for previous work. 

2 Description of the Alleged RC4 Cipher and Other 
Work 

The alleged RC4 keystream generator is an algorithm for generating an arbitra- 
rily long pseudorandom sequence based on a variable length key. The pseudoran- 
dom sequence is conjectured to be cryptographically secure for use in a stream 
cipher. The algorithm is parameterized by the number of bits n within a per- 
mutation element, which is also the number of bits that are output by a single 
iteration of the next state function of the cipher. The value of n = 8 is of greatest 
interest, as this is the value used by all known RC4 applications. 

The RC4 keystream generator was created by RSA Data Security, Inc. J5|. 
An anonymous source claimed to have reverse-engineered this algorithm, and 
published an alleged specification of it in 1994 [Sj. Although public confirmation 
of the validity of this specification is still lacking, we abbreviate the name ‘alleged 
RC4’ to ‘RC4’ in the remainder of this paper. We also denote n-bit RC4 as 
RC4/n. 

A summary of the RC4 operations is given in Table 1. Note that in this table, 
and throughout this paper, all additions and increments are done modulo 2 n . 



Table 1. The RC4 next state function, i and j are elements of 7Z / 2 n , and S is a 
permutation of integers between zero and 2 n — 1. All increments and sums are modulo 
2 n . 

1. Increment i by 1 

2. Increment j by S[i] 

3. Swap S 1 ^] and S[j] 

4. Output S'tSji] + S[j]] 



2.1 Previous Analysis of RC4 

The best previously known result for distinguishing the output of RC4 from 
that of a truly random source was found by Golic 00 , who presents a stati- 
stical defect that he estimates will allow an attacker to distinguish RC4/8 from 
randomness with approximately 2 40 successive outputs. However, this result ap- 
pears to be somewhat optimistic. We use the information theoretic lower bound 
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on the number of bytes needed to distinguish RC4 from randomness, for a given 
statistical anomaly, and use this to measure the effectiveness of Golic’s anomaly 
and our own anomalies (see the Appendix). The number of bytes of RC4/8 ou- 
tput needed to reduce the false positive and false negative rates to 10% is 2 44 ' 7 , 
using Golic’s anomaly, while the irregularities in the digraph distribution that 
we found require just 2 30 6 bytes to achieve the same result. 

Mister and Tavares analyzed the cycle structure of RC4 0. They observe 
that the state of the permutation can be recovered, given a significant fraction 
of the full keystream. In addition, they also present a backtracking algorithm 
that can recover the permutation from a short keystream output. Their analyses 
are supported by experimental results on RC4/n for n < 6, and show that an 
RC4/5 secret key can be recovered after only 2 42 steps, though the nominal key 
size is about 160 bits. 

Knudsen et. al. presented attacks on weakened versions of RC4 [3|. The wea- 
kened RC4 variants that they studied change their internal state less often than 
does RC4, though they change it in a similar way. Their basic attack backtracks 
through the internal state, guessing values of table entries that have not yet been 
observed, and backtracking upon contradictions. They present several variants 
of their attack, and analyze its runtime. They estimate that the complexity of 
their attack is less than the square root of the number of possible RC4 states. 

3 Analysis of Digraph Probabilities 

The probability with which each digraph (that is, each successive pair of n-bit 
outputs) will appear in the output of RC4 is directly computable, given some 
reasonable assumptions. The probability of each digraph for each value of the i 
index is also computable. By taking advantage of the information on i , rather 
than averaging over all values of i and allowing some of the detail about the 
statistical anomalies to wash away, it is possible to more effectively distinguish 
RC4 from randomness. 

To simplify analysis, we idealize the key set up phase. We assume that the 
key setup will generate each possible permutation with equal probability, and 
will assign all possible values to j with equal probability. Then, after any fixed 
N steps, all states of j and the permutation will still have equally probability, 
because the next state function is invertible. This is an idealization; the actual 
RC4 key setup will initialize j to zero. Also, the RC4 key setup routine generates 
only 2™ fc different permutations, where ri/- is the number of bits in the key, while 
there are 2”! possible permutations. Intuitively, our idealization becomes a close 
approximation of the internal state after RC4 runs for a short period of time. 

However, we leave in the assumption that the i pointer is initially zero after 
the key setup phase. Note that, since each step changes i in a predictable manner, 
the attacker can assume knowledge of the i pointer for each output. 

We compute the exact digraph probabilities, under the assumptions given 
above, by counting the number of internal states consistent with each digraph. 
This approach works with RC4 because only a limited amount of the unknown 
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internal state actually affects each output, though the total amount of internal 
state is quite large. 

Starting at step 4 of Table 1, we look at what controls the two successive 
outputs. The exhaustive list of everything on which those two outputs depend 
on is given in Table 2. 



Table 2. The variables that control two successive outputs of RC4 and the cryptana- 
lyst’s knowledge of them. 



Variable 


Cryptanalyst’s knowledge 


i 


known (increments regularly) 


j 


unknown 


S[i] 


unknown 


S[j ] 


unknown 


S[S\i] + S[j]] 


known (first output) 


S[i + 1] 


unknown 


S[j + S[i + 1]] 


unknown 


S[j + S[i + 1]] if i + 1 = S[i + 1] + S[j + S[i + 1]] 
S[i + 1] if j + S[i + 1] = S[i + 1] + S\j + S[i + 1]] 
S[S[i + 1] + 5[j + S[i + 1]]] otherwise 


known (second output) 



As the next-state algorithm progresses, for each successive unknown value, 
any value that is consistent with the previously seen states is equally probable. 
Thus the probability of a digraph (a, b ) for a particular value of i can be found 
by stepping through all possible values of all other variables, and counting the 
number of times that each output is consistent with the fixed values of i,a, 
and b. The consistency of a set of values is determined by the fact that S is 
a permutation. Because the start states were considered equally probable, this 
immediately gives us the exact value of the probability of i, (a, b). This approach 
requires about 2 5 ” operations to compute the probability of a single digraph, for 
a given value of i, as there are five n-bit unknowns in Table 2. Approximately 2 Sn 
operations are required to compute the probabilities of a digraph for all values 
of i. This puts the most interesting case of n = 8 out of immediate reach, with a 
computational cost of 2 64 operations. However, we circumvented this difficulty 
by computing the exact n = 3, 4, 5 digraph distributions for all i, observing 
which digraphs have anomalous probabilities, and estimating the probabilities 
of the anomalous digraphs for RC4/8. This method is described in the next two 
subsections. 

3.1 Anomalous RC4 Outputs 

The full digraph distributions for n = 3, 4, and 5 are computable with about 2 40 
operations. We computed these, and found that the distributions were signifi- 
cantly different from a uniform distribution. In addition, there is a consistency 
(across different values of n) to the irregularities in the digraph probabilities. In 





Statistical Analysis of the Alleged RC4 Keystream Generator 



23 



particular, one type of digraph is more probable than expected by a factor of 
approximately 1 + 2 _n+1 , seven types of digraphs are more probable than ex- 
pected by approximately 1 + 2 ~ n , and three types of digraphs are less probable 
than expected by approximately l + 2 _n . These results are summarized in Table 
3. 



Table 3. Positive and negative events. Here, i is the value of the index when the first 
symbol of the digraph the output. The top eight digraphs are in the set of positive 
events, and the bottom three digraphs are in the set of negative events, as defined in 
Section 13. II The probabilities are approximate. 



Digraph 


Value(s) of i 


Probability 


(0,0) 


2=1 


2-^"(l + 2-" +i ) 


(0,0) 


i ^ 1,2" - 1 


2-""(l + 2-") 


(0,1) 


M0, 1 


2~ 2n (l + 2 _ ") 


(i + 1,2™ — 1) 


2" - 2 


2~ 2n (l + 2“") 


(2 n — M + l) 


M 1,2" - 2 


2-""(l + 2-") 


(2" — 1,1 + 2) 


M 0,2" -1,2" -2, 2" -3 


2~ 2n (l + 2 _ ") 


(2" - 1,0) 


i» 2" - 2 


2~ 2n (l + 2“") 


(2" -1,1) 


i = 2" - 1 


2~ 2n (l + 2 _ ") 


(2" - 1,2) 


i = 0,l 


2 -2 "(l + 2 _ ") 


(2" _1 + l,2 n ~ 1 + 1) 


i = 2 


2 _2 "(1 + 2“") 


(2" - 1,2" - 1) 


M 2" - 2 


2~ 2n (l — 2 _ ") 


(0,* + l) 


M 0, 2" - 1 


2~ 2n (l — 2 _ ") 



We call the event that a digraph appears in the RC4 output at a given value 
of i a positive event when it is significantly more probable than expected. A 
negative event is similarly defined to be the appearance of a digraph at a given 

1 that is significantly less probable than expected. An exhaustive list of positive 
and negative events is provided in Table 3. 

In Section 0 we examine these particular digraphs to see why they are more 
or less likely than expected. Most of the positive events correspond to length 

2 fortuitous states, which will be defined in Section 0 For the (0,1) and (0,0) 
positive events, and the negative events, a more complicated mechanism occurs, 
which is discussed in the next section. 

3.2 Extrapolating to Higher Values of n 

To apply our attack to higher values of n without directly computing the di- 
graph probabilities, we computed the probabilities of positive events and nega- 
tive events by running RC4/8 with several randomly selected keys and counting 
the occurances of those events in the RC4 output. The observed probabilities 
(derived using RC4/8 with 10 starting keys for a length of 2 38 for each key), 
along with the computed expected probability from a truly random sequence, 
are given in Table 4. It is possible to distinguish between these two probability 
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distributions with a 10% false positive and false negative rate by observing 2 30 6 
successive outputs using the data in Table 3 (see the Appendix) . 



Table 4. Comparison of event probabilities between RC4/8 and a random keystream. 
The listed probabilities are the probability that two successive outputs are the specified 
event 





Positive Events 


Negative Events 


RC4/8 


0.00007630 


0.00003022 


Random 


0.00007600 


0.00003034 



In order to evaluate the effectiveness of our ‘extrapolation’ approach, we 
compute the amount of keystream needed using a test based on our observed 
positive and negative events, and compare that to the best possible test using 
the exact probabilities. For RC4/5 our selected positive/negative events require 
2 18 ' 6 keystream outputs, while the optimal test using the exact probabilities of 
all digraphs requires 2 18 ' 62 keystream outputs. These numbers agree to within a 
small factor, suggesting that the extrapolation approach is close to optimal. 

4 Understanding the Statistical Anomalies 

In this section we analyze the next state function of RC4 and show mechanisms 
that cause the increased (or decreased) likelihood of some of the anomalous 
digraphs. The figures below show the internal state of RC4 immediately before 
state 4 in Table 1 of the first output of the digraph. The bottom line shows 
the state of the permutation. Those permutation elements with a specific value 
are labeled with that value. Elements that are of unspecified value are labeled 
with the ‘wildcard’ symbol *. Ellipsis indicate unspecified numbers of unspecified 
elements, and elements separated by ellipsis may actually be in opposite order 
within the permutation. The elements pointed to by i and j are indicated by the 
i and j symbols appearing above them. 

The mechanism that leads to the (0, 1) digraph starts in the state 

i j 

*,..., *, 1, 0, *, . . ., *, AA, *, . . . where AA = i. 

Following through the steps in the next-state function, the first output will 
be 0, and at the following step 4, be in the state 

i j 

*,..., *, 1, AA, 0, *, ... 

and output an 1. This mechanism occurs approximately 2 -3 " of the time, and 
since other mechanisms output a (0, 1) 2 -2 " of the time, this accounts for the 
observed increase over expected. 

For the (0, 0) positive events, the additional mechanism starts with the fol- 
lowing state: 

i j 

*, AA, 0, *, . . ., *, BB, *, . . . where AA = i + 1 — j and BB = j. 
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The negative events, on the other hand, correspond to mechanisms that normally 
contribute to the expected output which do not work in those particular cases. 
For example, a normal method of producing a repeated digraph (AA, AA) is0 

i j 

*,..., *, BB, -1, *, AA, *, . . . 

Here the value AA occurs at location BB — 1. This outputs an AA, and steps 
into the state: 

j i 

*, -1, BB, *,..., *, AA, *, . . . 

This will output another AA, unless AA happens to be either BB or —1. In either 
case, this will output a BB. Since normal (AA, AA) pairs rely on this to achieve a 
near-expected rate, the lack of this mechanism for (—1, —1) prevents the output 
once every approximately 2~ 3n outputs, which accounts for the reduction of 
approximately a factor of 2~ n that we observe. 

These mechanisms do not depend on the value of n, and so can be expected 
to operate in the n = 8 case. This supports our extrapolation approach, which 
assumes that the positive and negative events to still apply in that case. 

5 Analysis of Fortuitous States 

There are RC4 states in which only N elements of the permutation S are involved 
in the next N successive outputs. We call these states fortuitous stated Since the 
variable i sweeps through the array regularly, it will always index N different 
array elements on N successive outputs (for N < 2 n ). So, the necessary and 
sufficient condition for a fortuitous state is that the elements pointed to by j 
and pointed to by S , [*] + S'[j] must come from the set of N array elements indexed 
by i. 

An example of an N = 3 fortuitous state follows: 

i j 

*, 255, 2, 1, ., 

1. advance i to 1 

2. advance j to 2 

3. swap 5[1] and 5 [2] 

4. output 5[1] = 2 

i j 

*, 2, 255, 1, ., 

1. advance i to 2 

2. advance j to 1 

3. swap 5 [2] and 5[1] 

4. output 5[1] = 255 



1 The symbol -1 is used as shorthand for 2™ — 1 here and throughout the paper. 

2 Observing such a state is fortuitous for a cryptanalyst. 
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3 i 

255, 2, X, -fc, . . ., 

1. advance i to 3 

2. advance j to 2 

3. swap .S' [3] and S [2] 

4. output S[3] =2 

3 i 

255, X, 2, -t-, -t-, . . ., 

If i = 0 at the first step, and assuming that all permutations and settings for j 
are equally probable, then the above initial conditions will hold with probability 
X / (256 • 256 • 255 • 254). When the initial conditions hold, the output sequence 
will always be (2,255,2). If RC4 outputs all trigraplrs with equal probability 
(the results in our previous section imply that it doesn’t, but we will use that as 
an approximation), the sequence (2,255,2) will occur at i = 0 with probability 
1/(256 • 256 • 256). This implies that, when the output is the sequence (2, 255, 2) 
when i = 0, then this scenario caused that output approximately 1/253 of the 
time. In other words, if the attacker sees, at offset 0, the sequence (2, 255, 2), he 
can guess that j was initially 3, and S[l], S[2], S[3] was initially 255, 2 and X, 
and be right a nontrivial portion of the time. 

The number of fortuitous states can be found using a state-counting algo- 
rithm similar to that given above. The numbers of such states, for small TV, are 
given in Table 5. The table lists, for each TV, the number of fortuitous states 
that exist of that length, the logarithm (base 2) of the expected time between 
occurrances of any fortuitous state of that length, and the expected number wit- 
hin that length of false hits. By false hit, we mean those output patterns that 
have identical patterns as a fortuitous state, but are not caused by a fortuitous 
state. For example, an attacker can expect to see, in a keystream of length 2 35 ' 2 , 
one fortuitous state of length 4 and 250 output patterns that look like fortuitous 
states. 



Table 5. The number of fortuitous states for RC4/8, their expected occurrance rates, 
and their expected false hit rates. 



Length 


Number 


Lg(Expected) 


Expected False Hits 


2 


516 


22.9 


255 


3 


290 


31.8 


253 


4 


6540 


35.2 


250 


5 


25,419 


41.3 


246 


6 


101,819 


47.2 


241 



It is not immediately clear how an attacker can use this information. What 
saves RC4 from an immediate break is that the state space is so huge that an 
attacker who directly guess 56 bits (which is approximately what you get with 
a length 6 fortuitous pattern) still has so many bits unguessed that there is 
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no obvious way to proceed. However, it does appear to be a weakness that the 
attacker can guess significant portions of internal state at times with nontrivial 
probability. 

It may be possible to improve the backtracking approaches to deriving RC4 
state EE! using fortuitous states. For example, a backtracking algorithm can 
be started at the keystream location immediately after a fortuitous state, using 
the values of the internal state that are suggested by that state. This approach 
extends slightly the attack using ‘special streams’ presented in Section 4.3 of 0. 

6 Directions for Future Work 

Some extensions of our current work are possible. One direct extension is to 
compute the exact digraph probabilities for the case that n = 8, and other 
cases for n > 5. Since RC4/n is actually a complex combinatorial object, it 
may happen that the results for these cases are significantly different than what 
might be expected. 

Another worthwhile direction is to investigate the statistics of trigraphs (that 
is, the three consecutive output symbols). The exact trigraph probabilities can 
be computed using an algorithm similar to that outlined in Section 0 The com- 
putational cost to compute the complete trigraph distribution, for all i, is 2 lln . 
We have computed this for RC4/4, and found that the length of outputs required 
to distinguish that cipher from randomness using trigraphs is about one-seventh 
that required when using digraphs. This result is encouraging, though it does 
not guarantee that trigraph statistics will be equally as effective for larger values 
of n. It must be considered that with n = 4, there are only 2 4 = 16 entries in 
the table S , and that three consecutive output symbols typically uses half of the 
state in this cipher. 

The computational cost of computing the complete trigraph distribution mo- 
tivates the consideration of lagged digraphs, that is, two symbols of fixed value 
separated by some symbols of non- fixed value. We call the number of interve- 
ning symbols of non-fixed value the lag. For example, adapting the notation used 
above, (1,*,2) is a lag one digraph with initial value 1 and final value 2. Here 
we use the ‘wildcard’ symbol * to indicate that the middle symbol can take on 
any possible value. Lagged digraphs are far easier to compute than trigraphs, 
because it is not necessary to individually count the states that are used only 
to determine the middle symbols. In general, the computational effort to com- 
pute the distribution of RC4/n lag L digraphs, for all i, requires about 2 ( - 8+L ' >n 
operations. 

Another approach to computing a digraph probability is to list the possible 
situations that can occur within the RC4 cipher when producing that digraph, 
generate the equations that must hold among the internal elements, and use 
algebraic means to enumerate the solutions to those equations. The number of 
solutions corresponds to the number of states that lead to that digraph. This 
approach could lead to a method to compute the exact digraph probability in a 
time independent of n. 
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Another direction would be to eliminate some of the assumptions made in our 
analysis. For example, the assumption that S and j are uniformly random is false, 
and it is especially wrong immediately after key setup. In particular, j is initially 
set to zero during the key setup. We venture that an analysis of fortuitous states 
that takes the key setup into consideration may lead to a method for deriving 
some information about the secret key. 

7 Conclusions 

We presented a method for computing exact digraph probabilities for RC4/n un- 
der reasonable assumptions, used this method to compute the exact distributions 
for small n, observed consistency in the digraph statistics across all values of n, 
and presented a simple method to extrapolate our knowledge to higher values of 
n. The minimum amount of RC4 output needed to distinguish that cipher from 
randomness was derived using information theoretic bounds, and this method 
was used to compare the effectiveness of our attack to those in the literature. 
Our methods provide the best known way to distinguish RC4/8, requiring only 
2 30 - 6 bytes of output. 

While we cannot extend either attack to find the original key or the ent- 
ire internal state of the cipher, further research may be able to extend these 
observations into an attack that is more efficient than exhaustive search. 

Appendix: Information Theoretic Bounds on 
Distinguishing RC4 from Randomness 

Information theory provides a lower bound on the number of outputs that are 
needed to distinguish RC4 output from a truly random sequence. We derive this 
bound for the case that with false positive and false negative rates of 10%, for 
our results and for those in the literature. 

Following [J], we define the discrimination L(p 1 q) between two probability 
distributions p and q as L(p,q) = p(s) lg |^y, where the sum is over all 
of the states in the distributions. The discrimination is the expected value of 
the log- likelihood ratio (with respect to the distribution p), and can be used to 
provide bounds on the effectiveness of hypothesis testing. A useful fact about 
discrimination is that in the case that l independent observations are made 
from the same set of states, the total discrimination is equal to l times the 
discrimination of a single observation. 

We consider a test T that predicts (with some likelihood of success) whether 
or not a particular input string of l symbols, each of which is in 2Z/2 n , was 
generated by ro-bit RC4 or by a truly random process. If the input string was 
generated by RC4, the test T returns a ‘yes’ with probability 1 — f3. If the 
input string was generated by a truly random process, then T returns ‘no’ with 
probability 1 — a. In other words, a is the false positive rate, and /3 is the 
false negative rate. These rates can be related to the discrimination between 
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the probability distribution p r generated by a truly random process and the 
distribution prca generated by n-bit RC4 (with a randomly selected key) , where 
the distributions are over all possible input strings. From the discrimination 
is related to a and (3 by the inequality 

L(Pr,PRC4) + (l-/3)lg- — (1) 

1 — a a 

Equality can be met by using an information-theoretic optimal test, such as 
a Neyman-Pearson test |]. We expect our cryptanalyst to use such a test, and 
we regard Equation [T] as an equality, though the implementation of such tests 
are outside the scope of this paper. 

Applying this result to use the RC4 digraph distribution p from the uniform 
random distribution (f > , 



me) = i E 2- 2 ” lg 2^(5) = ' 31g vk + - ■ « lg ^ (2) 

where T> is the set of digraphs, and p(d) is the probability of digraph d with 
respect to the distribution p. Solving this equation for l, we get the number of 
RC4 outputs needed to distinguish that cipher. 

To distinguish RC4 from randomness in the case that we only know the pro- 
babilities of the positive and negative events defined in Section eh we consider 
only the states N, P and Q , where N is the occurrence of negative event, P 
is the occurrance of a positive event, and Q is the occurrence of any digraph 
that is neither a positive nor negative event. Then the discrimination is given 
by Equation 0 where the sum is over these three states. Solving this equation 
for the number l of outputs ,with a = f3 = 0.1 and the data from Table 4 gives 

230.6 

The linear model of RC4 derived by Golic demonstrates a bias in RC4/8 with 
correlation coefficient 3.05 x 10“ ' m ■ In other words, an event that occurs after 
each symbol output with probability 0.5 + 1.52 • 10 -7 in a keystream generated 
by RC4, and with probability 0.5 in a keystream generated by a truly random 
source. Using Equation ^ with a = /3 = 0.1, we find that at least 2 44 7 bytes are 
required. 
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Abstract. SSC2 is a fast software stream cipher designed for wireless 
handsets with limited computational capabilities. It supports various pri- 
vate key sizes from 4 bytes to 16 bytes. All operations in SSC2 are word- 
oriented, no complex operations such as multiplication, division, and ex- 
ponentiation are involved. SSC2 has a very compact structure that makes 
it easy to implement on 8-,16-, and 32-bit processors. Theoretical analy- 
sis demonstrates that the keystream sequences generated by SSC2 have 
long period, large linear complexity, and good statistical distribution. 



1 Introduction 

For several reasons, encryption algorithms have been constrained in cellular and 
personal communications. First, the lack of computing power in mobile stations 
limits the use of computationally intensive encryption algorithms such as public 
key cryptography. Second, due to the high bit error rate of wireless channels, 
encryption algorithms which produce error propagation deteriorate the quality 
of data transmission, and hence are not well suited to applications where high 
bit error rates are common place. Third, the shortage of bandwidth at uplink 
channels (from mobile station to base station) makes encryption algorithms at 
low encryption (or decryption) rates unacceptable, and random delays in encryp- 
tion or decryption algorithms are not desirable either. To handle these issues, 
the European Group Special Mobile (GSM) adopted a hardware implemented 
stream cipher known as alleged A5 [13]. This stream cipher has two main vari- 
ants: the stronger A5/1 version and the weaker A5/2 version. Recent analysis 
by Biryukov and Shamir [16] has shown that the A5/1 version can be broken 
in less than one second on a single PC. Other than this weakness, the hardware 
implementation of the alleged A5 also incurs additional cost. In addition, the 
cost of modifying the encryption algorithm in every handset would be exorbitant 
when such a need is called for. For this reason, a software implemented stream 
cipher which is fast and secure would be preferable. 

To this end, we designed SSC2, a software-oriented stream cipher which is 
easy to implement on 8-, 16-, and 32-bit processors. SSC2 belongs to the stream 
cipher family of combination generators. It combines a filtered linear feedback 
shift register (LFSR) and a lagged-Fibonacci generator. All operations involved 
in SSC2 are word-oriented, where a word consists of 4 bytes. The word sequence 
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generated by SSC2 is added modulo-2 to the words of data frames in the manner 
of a Vernam cipher. SSC2 supports various private key sizes from 4 bytes to 16 
bytes. It has a key scheduling scheme that stretches a private key to a master 
key of 21 words. The master key is loaded as the initial states of the LFSR 
and the lagged-Fibonacci generator. To cope with the synchronization problem, 
SSC2 also supplies an efficient frame key generation scheme that generates an 
individual key for each data frame. Theoretical analysis indicates that the key- 
stream sequences generated by SSC2 have long period, large linear complexity, 
and good statistical distribution. 

2 Specification of SSC2 

The keystream generator of SSC2, as depicted in Figure 1, consists of a filter 
generator and a lagged-Fibonacci generator. In the filter generator, the LFSR 
is a word-oriented linear feedback shift register. The word-oriented LFSR has 4 
stages with each stage containing a word. It generates a new word and shifts out 
an old word at every clock. The nonlinear filter compresses the 4-worcl content 
of the LFSR to a word. The lagged-Fibonacci generator has 17 stages and is 
also word-oriented. The word shifted out by the lagged-Fibonacci generator is 
left-rotated 16 bits and then added to another word selected from the 17 stages. 
The sum is XOR-ed with the word produced by the filter generator. 




Fig. 1. The keystream generator of SSC2 



2.1 The Word-Oriented Linear Feedback Shift Register 

For software implementation, there are two major problems for LFSR.-based 
keystream generators. First, the speed of a software implemented LFSR is much 
slower than that of a hardware implemented one. To update the state of a LFSR, 
a byte-oriented or word-oriented processor needs to spend many clock cycles 
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to perform the bit-shifting and bit-extraction operations. Second, LFSR-based 
keystream generators usually produce one bit at every clock, which again makes 
the software implementation inefficient. To make the software implementation 
efficient, we designed a word-oriented LFSR in SSC2 by exploiting the fact that 
each word of a linear feedback shift register sequence can be represented as a 
linear transformation of the previous words of the sequence. 



{+} 













1 

















Fig. 2. The LFSR with characteristic polynomial p(x) = x(x 121 + x 63 + 1) 



The LFSR used in SSC2, as depicted in Figure 4.2, has the characteristic 
polynomial 

p(x) = x(x 127 + x 63 + 1 ), 

where the factor x 127 + x 63 + 1 of p(x) is a primitive polynomial over GF( 2). 
After discarding s 0 , the LFSR sequence, si, S 2 , . . ., is periodic and has the least 
period 2 12 ' — 1. The state S n = (s n + 127 , Sn +1261 • • • , s n ) at time can be divided 
into 4 blocks with each block being a word, that is, 

Sn (*En+3 1 *£n+2 5 ^ro+l > ^ro) * 

After running the LFSR 32 times, the LFSR has the state 
Sn- (-32 — 4? ^n+3 1 ^n +2 5 *£ro+l ) • 



It can be shown that 

^n+4 — Xn+ 2 © (^n+32 , 0 , 0 ,..., 0 )©( 0 , 5 n _(_31 , S n _(_3Q , . . . , S n _(_i ) . (1 ) 

Let <C denote the zero-fill left-shift operation. By x <Cj, it means that the word 
x is shifted left j bits and a zero is filled to the right-most bit every time when 
x is shifted left 1 bit. Similarly, let denote the zero-fill right-shift operation. 
With these notations, we can rewrite equation (1) as follows 

Xn+ 4 ^n-f-2 © ^n+1 < §^31 ©^n -^"lj (2) 

which describes the operation of the word-oriented LFSR in Figure 1. It is in- 
teresting to note that the feedback connections of the word-oriented LFSR are 
not sparse even though the bit-oriented LFSR described by p(x) has very sparse 
feedback connections. 

In the bit-oriented LFSR described by Figure 2, the stage 0 is not involved 
in the computation of the feedback and hence is redundant. It is left there just 
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to make the length of the LFSR to be a multiple of 32. For this reason, the 
content of stage 0 will be excluded from the state of the LFSR. Let S' n denote 
the bit-oriented LFSR state consisting of the contents of stage 1 through stage 
127, namely, 

S n = (Sn+127 ) ^n+126i • • • 5 ^n+l) ■ 

Correspondingly, let S" denote the state of the word-oriented LFSR, where S" 
is made up of the contents of stage 1 through stage 127 of the word-oriented 
LFSR at time n. Thus, 

S n = ^ 32 n) n> 0. (3) 



Proposition 1 . Assume that the initial state Sq of the word-oriented LFSR 
described by (2) is not zero. Then the state sequence .S'" , S'/,... is periodic and 
has the least period 2 127 — 1. Furthermore, for any 0 < * < j < 2 127 — 1, S" ^ S ". 

Proof. Since x 12 ' +x 63 + l is a primitive polynomial oner GF( 2), and S' 0 = .S'" ^ 
0, the state sequence Sq, S( , . . . of the bit-oriented LFSR is periodic and has the 
least period 2 127 - 1. Thus, by (3), S" + 2 1 27 _ 1 = S' 32( ^ n+2l27 _^ = S'". Hence, the 
sequence Sq, S'", ... is periodic and has a period of 2 127 — 1. The least period 
of the sequence should be a divisor of 2 12 ' — 1. On the other hand, 2 12 ' — 1 is 
a prime number (divided by 1 and itself). So the least period of Sq, S'/, ... is 
2 127 - 1 . 

Next, assume that S" = S" for i and j with 0 < i < j < 2 127 — 1. Then 
S 3 2 , = S' 32 j, which implies that 32(j — i) is a multiple of 2 127 — 1. Since 2 127 — 1 
is prime to 32, j — i is a multiple of 2 12 ' — 1, which contradicts the assumption. 



2.2 The Nonlinear Filter 

The nonlinear filter is a memoryless function, that is, its output at time n only 
depends on the content of the word-oriented LFSR at time n. Let (x n + 3 ,x n + 2 , 
x n+ \,x n ) denote the content of the word-oriented LFSR at time n. The output 
at time n, denoted by z' n , is described by the following pseudo-code: 

Nonlinear-Function F(x n + 3 , x n + 2 , x n +i,x n ) 

1 A <— x n+ Q + ( x n V 1) mod 2 32 

2 c <— carry 

3 cyclic shift A left 16 bits 

4 if (c = 0) then 

5 A <— A + x n+ 2 mod 2 32 

6 else 

7 A <— A+ ( x n+ 2 © ( x n V 1)) mod 2 32 

8 c <— carry 

return A + ( x n+ i © x n+ 2 ) + c mod 2 32 



9 
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Let Ci and C 2 denote the first (line 2) and second (line 8) carry bits in the 
pseudo-code. For a 32-bit integer A , let (A) 16 denote the result of cyclicly shifting 
A left 16 bits. Then the function F has the following compact form: 

z' n = ( x n+3 + (x„ V 1)) 16 + x n+2 ® ci(x n V 1) + x n+ i ® x n+2 + c 2 mod 2 32 , (4) 

where V denotes bitwise “OR” operation. The priority of ® is assumed to be 
higher than that of +. Note that the least significant bit of x n is always masked 
by 1 in order to get rid of the effect of stage 0 of the LFSR in Figure 2. 



2.3 The Lagged-Fibonacci Generator 

Lagged-Fibonacci generators, also called additive generators, have been widely 
used as random number generators in Monte Carlo simulation [4,6]. Mathe- 
matically, a lagged Fibonacci generator can be characterized by the following 
recursion: 

Vn = Vn-s + Un-r Uiod M, U > T. (5) 

The generator is defined by the modulus M, the register length r, and the lag s, 
where r > s. When M is prime, periods as large as M r — 1 can be achieved for 
the generated sequences. However it is more common to use lagged-Fibonacci 
generators with M = 2 m , m > 1. These generators with power-of-two moduli are 
much easier to implement than prime moduli. The following lemma was proved 
by Brent [1]. 

Lemma 1. Assume that M = 2 m , m > 1, r > 2, and the polynomial x r + X s + 1 
is primitive over GF( 2). Then the sequence yo,yi,... of the lagged-Fibonacci 
generator described by (5) has the least period 2 m-1 (2 r — 1) if yo,yi, ■ ■ -,y r - 1 
are not all even. 

In SSC2, the lagged-Fibonacci generator with s = 5, r = 17, and M = 2 32 
was adopted. We implemented this generator with a 17-stage circular buffer, 
B, and two pointers, s, and, r. Initially B [17] , B [ 1 6] , . . . , B [1] are loaded with 
j/o > 2/i i • • • , 2/i6 1 and s and r are set to 5 and 17, respectively. At every clock, a 
new word is produced by taking the sum of B[r] and B[s] mod2 32 , the word B[r] 
is then replaced by the new word, and the pointers s and r are decreased by 
1. In this way, the buffer B produces the lagged-Fibonacci sequence. We use a 
multiplexer to generate the output sequence z",n > 0 The output word z[’ n is 
computed from the replaced word y n and another word selected from the buffer 
B. The selection is based on the most significant 4 bits of the newly produced 
word 2/n+ir- The output word at time n, denoted by z", is given by 

z n = {Vri) 16 + B[1 + ((y n+ 17 > 28 ) + s„ +i mod 16)] mod 2 32 , (6) 

where s n +i denotes the value of s at time n+ 1. The pseudo-code for z" is listed 
as follows: 
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1 A <- B [r] 

2 D 4— B[s] + B [r] mod 2 32 

3 B [r] <- D 

2 r <- r - 1 

3 s s — 1 

4 if (r = 0) then r -< — 17 

5 if (s = 0) then s 17 

6 cyclicly shift A left 16 bits 

7 output A + B[1 + (s + D ^>28 mod 16)] mod 2 32 



Proposition 2 . Assume that the initial state yo , y i, . . 2/16 of the lagged-Fibo- 
nacci generator are not all even. Then the sequence z" = Zq , z " , . . . is periodic 
and its least period is a divisor of 17(2 1 ' — 1)2 31 . 

Proof. Let r n and s n denote the values of r and s at time n. It is easy to verify 
that 

r„ = 17 — (n mod 17), 

and 

s„ = 17 — (n + 12 mod 17). 

Hence, the two sequences r = ro, iq, . . . and s = so,Si, . . . are periodic and have 
the period 17. Since yo,yi, ■ ■ ■ ,yie are not all even, by Lemma 5.1, the lagged- 
Fibonacci sequence y = yo,yi,--- is periodic and has the period (2 17 — 1)2 31 . 
Let Ty denote the period of y. For any 1 < i < 17, at time n = 17 Ty — i, the 
pointer r has value r„ = 17 — (n mod 17) = i, thus, the content of B[i] is replaced 
by y n + 17 = ynTjj-i+n = yn-i- Hence, at time n = 17 Ty - 17, the word in B[17] 
is replaced by yo, at time n = 17 Ty — 16, the word in B [16] is replaced by y\, 
. . ., at time n = 17 Ty — 1, the word in B [1] is replaced by y\Q. Therefore, at time 
17 Ty, the content of B is the same as its content at time n = 0. Similarly, it can 
be proved that the content of B at time n + 17 Ty is the same as its content at 
time n. By (6), z" can be expressed by 

z'n = (y n ) 16 + B[1 + ((y n + 17 > 28) + s n+ i mod 16)]. 

Let i n denote the index in B in the above equation, namely, 

in = 1 + {(Vn+i7 > 28) + s„+i mod 16). 

Then 

in+i7Ty = 1 + ((y n +i7T s +i7 ^ 28) + s„+i7T s +i mod 16 = i n ). 
Consequently, 

z n+17Ty = {Vn+17Ty)l6 + B[i n+ i7T s ] mod 2 32 = z'^, 

which implies that the period of z" divides 17 Ty. 
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3 Cryptographic Properties of SSC2 

Period, linear complexity, and statistical distribution are three fundamental mea- 
sures of security for keystream generators. Unfortunately, these measures are dif- 
ficult to analyze for most of the proposed keystream generators. In this section, 
an assessment of the strength of SSC2 will be carried out with respect to these 
measures. 

In the following, we will use z = Zo, Zi, . . . to denote the keystream sequence 
generated by SSC2. It is the sum of the two sequences z' = z' 0 , z[, . . . of the filter 
generator, and z" = Zq , z ", . . . of the lagged-Fibonacci generator. 

Theorem 1. Assume that the initial state S’f of the word-oriented LFSR is not 
zero, and the the initial state yo,yi, ■ ■ ■ ,yw of the lagged-Fibonacci generator 
are not all even. Then the least period of the keystream sequence generated by 
SSC2 is greater than or equal to 2 128 — 2. 

Proof. Let Tz,T ~ /, and TV* denote the least periods of 5, z ' , and z " , respectively. 
By Proposition 1 and Proposition 2, TV = 2 127 — 1, and TV' is a factor of 
17(2 17 — 1)2 31 . Since 2 127 — 1 and 17(2 1 ' — 1)2 31 , are relatively prime, T V and 
TV' are also relatively prime. Hence 7~j = TV TV'- Therefore Tj > 27V = 2 128 — 2. 

Let A(z') and A{z") denote the linear complexity of z' and z" . According to 
[11], the linear complexity of 5 = z' © z" is bounded by 

A(z') + A(z") - 2gcd(7V,7» < A{z) < A(z') + A(z"). (7) 

Thus, if we have lower bounds on the linear complexity of either z' or z", we 
can achieve lower bounds on the linear complexity of z. In the following, we will 
analyze the linear complexity of S'. 

We can treat the sequence z! = z' 0 , z[, . . . in three different forms. First, it is 
a sequence of words with z' n = (z' 31 n , z' 30 z' 0 n ); second, it is a sequence of 

bits; and third, it can be considered as a collection of 32 component sequences, 
z\ — z' i0 , z' 1; . . . , 0 < * < 31. For any 0 < i < 31, z' in can be described by 

Z i,n = /i( S 127,ra’ S 126,ra’ ) S l,n)’ n — 0, (®) 

where (s" 2 7 >n , s” 2 q „,•••, s” n ) is the state of the word-oriented LFSR at time n, 
and fi is the i-tli component of the nonlinear filter F. Assume that the nonlinear 
order, ord(fi ), of fi is £i. From Key’s analysis [5], the linear complexity of 2' is 
bounded by 

^)<^=e(7)- (9) 

The upper bound Lg t is usually satisfied with equality. But, there are also few 
exceptions that the actual linear complexity is deviated from the expected value 
given by the upper bound. Rueppel [12] proved that, for a LFSR with primitive 
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connection polynomial of prime degree L, the fraction of Boolean functions of 
nonlinear order £ that produce sequences of linear complexity Le is 

P d w exp (-Lt/(L ■ 2 l )) > e~ 1/L . (10) 

For the filter generator of SSC2, we have P d > . Hence, the linear comple- 

xity of z'i is virtually certain to be L ^. To determine the nonlinear order of /), 
we will study the nonlinear order of integer addition. 

Let x = {x 31 ,x 30 , . . . , x 0 ) and y = (2/31,2/30, • • • , 2/0 ) denote the binary repre- 
sentation of two 32-bit integers, where Xq and y 0 are the least significant bits of x 
and y. The sum, x+y mod 2 32 , defines a new 32-bit integer z = (231, Z30, • ■ • , zq). 
The binary digits Zi, 0 < i < 31, are recursively computed by 



= Xi 0 2/i © Cj_i, 


(11) 


Xiyi © (xj © 2/i)ci_i, 


(12) 



where Cj_ 1 denotes the carry bit, and c_i = 0 . The 31st carry bit C31 is also 
called the carry bit of x + y. In (13) and (14), z t and Cj are Boolean functions of 
x. t and 2/i, 0 < i < 31. In the following, we will use ord(zj ) and ord(ci) to denote 
the nonlinear order of the Boolean functions represented by z. t and Cj. 

Lemma 2. Assume that x = (2:31, X30, . . . , xo), y = (2/31,2/30, • • • 2/o) are two 32- 
bit integers, and Xq = 1. Let z = (231, 230, ■ • • , zo) denote the sum of x + y, and 
c = (C31, C30, . . . , Co) denote the carry bits produced by the summation. Then 
ord(ci) = i + 1, 0 < i < 31. Furthermore, ord((xi © 2/1)031) = 32, 1 < i < 31, and 
ord(c 15 c 31 ) = 33. 

Proof. By (14), Co = 2/0, an d c i = 2C1J/1 © (%i © 2/i)//o- So o?’d(co) = 1 mid 
o?’d(ci) = 2. Assume that ord(ci) = i+ 1, i > 2. Since Xi + 1 and 2/i+i do not appear 
in the Boolean function represented by Cj, ord^Xi+iyi+i) < ord((xi + 1 ®yi+\)Ci) 
for i > 2, 



ord(c,;+i) = ord(x i+ iy i+1 0 (x i+ \ 0 y i+ i)ci) 

= ord((x i+ 1 0 2/i+i)cj) 

= i © 2 

By induction, ord(ci) = i+l,0<*< 31. Using similar techniques, it can be 
proved that ord((xi © 2/1)031) = 32 and ord(ci 3 c 3 i) = 33. 

Lemma 3. Let z = F{x 3, X 2 , Xi,Xq) denote the nonlinear function described by 
(4), where z = (231,-30, ■ • • , zo), an d Xi = (x» i3 i, Xj )30 , . . . , x ij0 ), 0 < i < 3. For 
any 0 < i < 31, let Zi = fi(x 3 , X 2 ,Xi,Xo), then ord(fi) >64 + i. 

Proof. Recall that F is a mapping of GF( 2) 128 to GF( 2) 32 given by 

z = (x 3 + (x 0 V l))i 6 + X 2 0 Ci(x 0 V 1) + xi 0 x 2 + C 2 mod 2 32 , 

where ci and C2 are the carry bits produced by the first two additions. 
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Let Z\ = (.21,31, ,21,30, • • • , 2i,o) denote the sum of and xq V 1 . The carry 
bits produced by the summation is denoted by 01,31, 01,30, . . . , Ci,o- It is clear 
that Ci = Ci, 3i- By Lemma 2 , ord(c 3 ) = 32 , and ord(zi t i) = ord(ci,j_i) = i, 1 < 
i < 31 . 

Let Z[ = ( 4 , 31 ) 4 , 30 ) ••• > 4 ,o) denote (Zi)i 6 , that is, z' Xi = 2 i,i 6 +i, for 
0 < i < 15 , and z' xi = 2 i,i_i 6 , for 16 < i < 31 . Let Z 2 = (22,31, 22,30) •••, 22,0) 
denote the sum of Z' x and x 2 ® Ci(xo V 1 ), and 02,31, 02,30, ■ • • , 02,0 denote the 
carry bits produced by the summation with 02,-1 = Ci. By ( 13 ) and ( 14 ), 

Z 2 ,i = 4 ,i © x 2 ,i © axo,i © c 2 ,»-i ( 13 ) 

02, i = z' l i (x 2 ,i ® 01*0, ») ® {z' 3i ® x 2 ,i ® CiX 0 ,i)c 2 ,i— 1, ( 14 ) 

where 02,-1 = 0 and x 0 ,o = 1 . Rewriting ( 16 ), we have 

C 2 ,i = (4 ,» © c 2 ,i-l)X 2 ,i © z' 1:i C!Xo,i ® (2^ ® C 1 X 0 ,j)c 2 ,i-l . 

Since x 2 ,i does not appear in z[ ,CiXo,i ® (z[ ® CiXo,i)c2,i-i) we have 

ord(c 2 ) i) > ord((z[ ti ® c 2 ,i-i)* 2 ,»). (15) 

The carry bit 02,0 has the following expression, 

C2,0 = 4,o(£2,o © CiX 0 ,o) 

= zi t ie(x 2 ,o © ci, 31) 

= (23,16 © 2®, 16 © Cl, is)*2, 0 © (*3,16 © *0,16) c l,31 © Ci, 1501,31 

By Lemma 2 , ord((x 3,16 © *o,i6)ci,3i) = 32 , and ord(ci, 1501,31) = 33 . The order 
of (*3,16 © *0,16 © Ci,i5)*2,o is equal to 17 . So the order of 02,0 equals 33 . O11 the 
other hand, ord(zi ,,) < i ,0 < i < 31 . Hence, ord(c 2 fi ) > or d(z[ 1 ) = ord(zi t n). 
By ( 17 ), ord(c 2 ,i) > ord(c 2 , 0*2,1) > 33 . By induction, it can be proved that 
ord{c 2ii ) > 33,0 <i< 31 . Thus ord(c 2 t i) > ord{c 2 : i-i) + 1 . Hence ord(c 2 i 3 i) > 
64 . 

Let Z 3 = (23,31, 23,30, • ■ • ,23,0) denote the sum of Z 2 + xi®x 2 + c 2 . The carry 
bits produced by the summation is denoted by 03,31, 03,30, • • ■ , 03,0- It is obvious 
that c 2 = 02,31, and 2 = Z 3 . By ( 13 ) and ( 14 ), we have the following expressions 
for Zi and 03,4, 

Zi = 22, » ® *1,* © *2,< © c 3 ,,— 1 ( 16 ) 

C3,i = 2 2 ,*(*1,» © 22,*) © (22,* © *1,» © *2,i)C3,i-l, (17) 

where 03,-1 = 02,31- By ( 15 ), ord(z 2 fi ) = ord(ci) = 32 . Thus, ord(zo) = 
or<i(c3,_i) > 64 . Rewriting ( 19 ), we have the following expression for 03,*, 

C3,i = (Z 2 ,i © C3,i_i)*i,i © 2 2 ,i*2,i © (22,* © *2,i)C3,i-l. (18) 

Substituting ( 20 ) into ( 18 ), 

Zi = 22,iffi*l,i©*2,iffi(22,i-l©C 3 , i _2)*l, i -lffi22,i-l*2, l -l©(22,i-lffi*2,i-l)C3,i-2. 

( 19 ) 
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Since Xi^-i only appears in (22,1-1 © C3,j-2)*i,i-i> it is clear that 

ord(zi) > ord(z 2 ,i-i 0 c 3ii _ 2 ) + 1. (20) 

By (20), z 2t i 0 C3y_i is described by 

Z2,i 0 c 3 ,i-i = z 2 ,i 0 (2 2 ,i-i 0 c 3>i _ 2 )a;i >i _i 0 2 2ji _i;r 2 .,_i 0 (2 2 ,*-i 0 ar2,*-i)c 3 ,»- 2 - 

(21) 

Since appears only in the second term (2 2| ,-_i 0 c 3> i_ 2 )a;iy_i, ord(z 2 ^ 0 

c 3,i-i) > ord(z 2ji - 1 0 c 3ji _ 2 ) + 1. Let di = ord(z 2t i 0 c 3) »_i), 0 < i < 31. Then 
di > dj_i + 1. On the other hand, 

d 0 = ord(z 2fi 0 c 3 _i) 

= ord(z 2t o 0 c 2)3 i) 

> 64. 

Hence, di > 64 + i, 1 < i < 31. By (22), ord(zi) > 64 + i, which proves the 
lemma. 

Theorem 2. Let z = Zo,Zi, . . . denote the word sequence generated by SSC2. 
For any n > 0, let z n = (231^, 2 3 o, n) • • • , Zo, n )- Let Sq be the initial state of 
the word-oriented LFSR and yo,yi, ■ ■ ■ ,yie be the initial state of the lagged- 
Fibonacci generator. Assume that S” is not zero and y 0 ,yi, ■ ■ ■ ,yio are not all 
even. Then, with a probability greater than e ~™ , the binary sequences Z{ = 
ZifiZi.i . ■ ■ , 0 < i < 31, have linear complexity 

W£(7)- 2 > 2126 . 

Proof. Similar to the decomposition of z = Zo,Zi,..., we can decompose the 
word sequence z! = z' 0 ,z' 1 , . . . generated by the filter generator into 32 component 
sequences, z\ = z\ 0 , z\ l5 . . . , 0 < i < 31. Similarly, let z" = z" 0 , z " x , . . . , 0 < * < 
31 denote the component sequences of z" = z'q,z", .. . generated by the lagged- 
Fibonacci generator. Then Zi = z[ 0 z" . Let and T^" denote the least 

period of z[, and z" . By Proposition 1, the word sequence z! = z' 0 , 2 ) , . . . 
has the least period 2 127 — 1. Hence, the component sequence z\ has a period 
of (2 127 — 1). By Proposition 2, it is easy to verify that the sequence z" has a 
period of 17(2 17 — 1)2 31 . Therefore, gcd (T^Tj") = 1. By (7), we have 

A(zi) > A{z[) - 2 

By Lemma 3, with a probability no less than e~^ , the linear complexity of z[ 
is at least ^64+0 which proves the theorem. 

Theorem 2 implies that the linear complexity of the component sequences 
2,, 0 < * < 31, is exponential to the length of the LFSR and is therefore, resilient 
to the Berlekamp-Massey attack. 
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We next study the question concerning how close a SSC2 sequence resembles 
a truly random sequence. Mathematically, a truly random sequence can be mo- 
deled as a sequence of independent and uniformly distributed random variables. 
To measure the randomness of the keystream sequences generated by SSC2, let’s 
consider the distribution of every 32-bit word in a period. 



Proposition 3. Let X 3 , X 2 , X x , and X 0 be independent and uniformly random 
variables over GF{ 2) 32 and Z = F(X 3 ,X 2 ,X x ,X 0 ) be the output of the filter 
generator in SSC2. Then Z is uniformly distributed over GF( 2) 32 . 

Proof. Let Z' = ( X 3 + (X 0 V l))i6 + ^2 0 c x (X 0 V 1) + c 2 mod 2 32 . Then 
Z = X i ® X 2 + Z' mod 2 32 . By the chain rule [2], we can express the joint 
entropy H(Z, Z' , X x ® X 2 ) as 

H(Z, Z\ X x ® X 2 ) = H(Z') + H{Z\Z’) + H(X x ® X 2 \ Z, Z’) 

= H(Z') + H(X x ffi X 2 \Z') + H{Z\X x © X 2 , Z'). 



Since X x © X 2 is uniquely determined by Z and Z ' , H(X x © X 2 \ Z,Z') = 0. 
Similarly, H(Z\X x © X 2 , Z r ) = 0. Hence, H{Z\Z') = H{X i © X 2 \ Z'). 

Since X x does not appear in the expression represented by Z ' , X x and Z' are 
statistically independent. For any a, b € GF( 2) 32 , 



p(X i © X 2 = a\z'=b) 



p{X x © X 2 = q, Z' = b ) 

P(Z' = b) 

J 2 cggf( 2)32 p(X i © X 2 — a\z'=b,x 2 =c)p(Z' = b 1 X 2 = c ) 
p{Z' = b) 

Ec £ gf(2)32 p(X i = a © c\ Z '=b,x 2 =c)p(Z' = b,X 2 = c ) 
p{Z> = b) ' 



Since X x and (Z’ ,X 2 ) are independent, p(X i = c© a|z'=6,x 2 =c) = 2 32 . Thus, 



p(X x © X 2 = a\z'=b) 



2 32 S c eGF( 2) 32 p(Z' — b, X 2 
p{Z’ = b) 



c) 



= 2 -32 



Hence, H{Z\Z') = H(X x © X 2 \Z’) = 32. Therefore, H(Z) > H{Z\Z') = 32, 
which implies that H{Z) = 32, or equivalently, Z is uniformly distributed over 
GF{ 2) 32 . 



Recall that the state sequence S 3 , S”,. . . of the word-oriented LFSR has 
the least period 2 12 ' — 1 and all states are distinct in the period if the initial 
state S 3 is non-zero. Hence every non-zero state appears exactly once in the 
least period. For this reason, we model the state S” as a uniformly distributed 
random variable over GF( 2) 12 '. Proposition 3 indicates that we can model the 
filter generator sequence as a sequence of uniformly distributed random variables 
when the initial state of the filter generator is non-zero. 
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Theorem 3. Let Sq and do = (yo, yi, ■ ■ • , yu) be the initial states of the LFSR 
and the lagged-Fibonacci generator respectively. Assume that Sq and Yo are 
random variables, and S" ^ 0. Let Z n denote the word generated by SSC2 at 
time n. Then 

32 - 7(5"; Y 0 ) < H(Z n ) < 32, 

where I(Sq: Yq) denotes the mutual information between Sq and Yo, given by 

HS2;Y 0 ) = H{S%) - H{S%\Y 0 ). 

Proof. Since Z n is a random variable over GF(2) 32 , it is obvious that H(Z n ) < 
32. Let Z' n and Z" denote the respective output of the filter generator and the 
lagged-Fibonacci generator at time n. Then Z n = Z' n + Z" mod 2 32 . Moreover, 
77(Z„|Z") = H(Z' n \Z"). According to the data processing inequality [2], 

I (Z' n \ z") < I (Z' n -, Y 0 ) < I(Sq\ Yq). 

Since Sq ^ 0, H{Z' n ) w 32. Consequently, 

H(Z n ) > H(Z n \Z ") 

= H{Z' n \Z”) 

= H (Z' n ) - I(Z' n - Z") 

> 32 — I(Sq;Yq). 



By Theorem 3, we can conclude that the keystream sequence of SSC2 is a 
sequence of uniformly distributed random variables if the initial states of the 
word-oriented LFSR is non-zero and statistically independent of the initial state 
of the lagged-Fibonacci generator. However, the problem of determining whether 
the keystream sequence of SSC2 is a sequence of independent random variables 
or not remains open. 



4 Correlation Analysis of SSC2 

SSC2 is a very complex mathematical system in which several different types 
of operations, such as exclusive-or, integer addition, shift, and multiplexing, are 
applied to data iteratively. If we analyze the keystream generator in Figure 1 
as a whole, it would be difficult to get information about the internal states 
of the word-oriented LFSR and the lagged-Fibonacci generator. However, if the 
keystream sequence leaks information about the filter generator sequence or 
the lagged-Fibonacci sequence, this information might be exploited to attack 
the filter generator or the lagged-Fibonacci generator separately. This kind of 
attack is called divide-and-conquer correlation attack which has been successfully 
applied to over a dozen keystream generators [3,8,10,11,14,15]. For the moment, 
let’s assume that the key of SSC2 consists of the initial states Sq and Y 0 of the 
word-oriented LFSR and the lagged-Fibonacci generator respectively. 
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Theorem 4. Assume that the initial states S" and Yq of the filter generator 
and the lagged-Fibonacci generator of SSC2 are random variables, Sq ^ 0. Then 
the outputs Z' n and Z" of the filter generator and the lagged-Fibonacci generator 
at time n are also random variables. Let Z n denote the output of SSC2 at time 
n. Then 



I(Z n ; Y 0 ) < 32 - H(Z' n ) + I (Sq ; Y 0 ), 

I{Z n ; S'") < 32 - H{Z") + 7(5"; Y 0 ). 

Proof. By the chain rule [2], the joint entropy H(Z' n , Y 0: Z n ) can be represented 
as follows: 



H(Z’ n ,Y 0l Z n ) = H(Y 0 ) + H(Z n \Y 0 ) + H(Z' n \Z n ,Y 0 ) 

= H(Y 0 ) + H(Z' n \Y 0 ) + H(Z n \Z' n ,Y 0 ). 

Since Z’ n can be uniquely determined by Z n and Yq, H{Z' n \Z n ,Yo) =0. Similarly, 
H(Z n \Z' n ,Y 0 ) = 0. Thus, H(Z n \Y 0 ) = H(Z' n \Y 0 ). Therefore, 

I(Z n ;Y 0 ) = H(Z n ) - H(Z n \Y 0 ) 

= H(Z n ) - H(Z' n \Y 0 ) 

= H(Z n )-H(Z’ n ) + I(Z’ n -Y 0 ) 

According to the data processing inequality [2], I(Z' n \ Yo) < /(S',"; Fo)- Hence, it 
follows that 

I(Z n - Y 0 ) < H(Z n ) - H{Z’ n ) + I{Sg; Y 0 ). 

On the other hand, H(Z n ) < 32, 

I(Z n - Y 0 ) < 32 - H(Z' n ) + I (S''; Y 0 ). 

Similarly, it can be proved that 

I(Z n ; S'') < 32 - H(Z'i) + I(S''; Y 0 ). 

According to the empirical test in [7], we assume that the sequence y = 
j/o > 2/i) • • ■ °f the lagged-Fibonacci generator is a sequence of pairwise independent 
and uniformly distributed random variables. By (6), Z" = (y n )i§ + y' n mod 2, 
where y' n is uniformly selected from y n+ i,y n + 2 , ■ ■ -,y n + 17 - If y' n 7^ Vn+ 17 , it is 
obvious that Z " is uniformly distributed. If y' n = y n +n , then Z" — (y n )i& + 
l in + y n +i 2 mod 2 32 , which is also uniformly distributed since y n +i 2 and y n are 
independent. Thus, for any n > 0, Z n is not correlated to either Sq or Y 0 if Sq 
and Yq are statistically independent. So we can not get any information about Sq 
or Y 0 from each Z n . However, this does not mean that we can not get information 
about Sq and Yq from a segment Zq, Z±, . . . , Z m of the keystream sequence. The 
question is how to get information about Sq and Yq from a segment of the 
keystream sequence, which remains open. 




44 



M. Zhang, C. Carroll, and A. Chan 



5 Scalability of SSC2 

The security level of SSC2 can be enhanced by increasing the length of the 
lagged-Fibonacci generator. Let y = yo,Vi, ■ ■ ■ be the word sequence generated 
by a lagged-Fibonacci generator of length L. As described in Section 2.3, we can 
implement the lagged-Fibonacci generator with a buffer B of length L and two 
pointers r and s. Let h = [log L\ . The output word z” of the lagged-Fibonacci 
generator can be described by 

z'n = y n + B[1 + {(y n +L » (32 - h)) + s n+ i mod 2 h )] mod 2 32 , (22) 

where s„+i is the value of the pointer s at time n + 1. We define the number 
Lh = L [log L\ as the effective key length of the keystream generator as described 
by Figure 1, where the lagged-Fibonacci generator has length L. The effective 
key length gives us a rough estimation of the strength of the keystream generator. 
We believe that the actual strength might be much larger than that described by 
the effective length. Corresponding to private keys of 128 bits, lagged-Fibonacci 
generators with length between 17 and 33 are recommended. 

6 Key Scheduling Scheme 

SSC2 supports private keys of various sizes, from 4 bytes to 16 bytes. To stretch 
a private key less than or equal to 4 words to 21 words, a key scheduling scheme 
is required. By Theorem 3 and Theorem 4, the initial states of the word-oriented 
LFSR and the lagged-Fibonacci generator should be independent. With a hash 
function such as SHA-1[9], it is not difficult to generate such 21 words. When 
a good hash function is not available, we designed the following scheme which 
generates 21 words (called the master key) from the private key K. 

Master-Key-Generation K rnaster ( K ) 

1 load I\ into the LFSR S, repeat K when necessary 

2 for i •<— 0 to 127 do 

3 run the linear feedback shift register once 

4 S[ 1] <- S[l\ + F(S) mod 2 32 

5 i <— i + 1 

6 for i •<— 1 to 17 do 

7 run the linear feedback shift register once 

8 B[i] <- S[ 4] 

9 i<— i + 1 

10 A<-S[l] 

11 for i •<— 1 to 34 do 

12 run the linear feedback shift register once 

13 run the lagged Fibonacci generator once 

14 index <— 1 + A 28 

15 A B[index] 

16 B [index\ -s— A ® 5[1] 
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17 S[ 1] <- A + S[l] mod 2 32 

18 i i — i T 1 

19 B [17] B[17] V 1 

20 return S and B 

In the above pseudo-code, the LFSR is denoted by S, which is an array of 
4 words. Every time when the LFSR runs, the word in moves out, the 
array shifts right one word, and the newly computed word moves in S'[l]. The 
lagged-Fibonacci generator is denoted by B as usual. The master key generation 
experiences 5 stages. In stage 1, the private key is loaded into the LFSR. In stage 
2 (line 2 - line 5) , the private key is processed. We run the LFSR (actually the 
filter generator) 128 times in order that approximately half of the bits in S will 
be 1 even if there is only one 1 in K. In stage 3 (line 6 - line 9), 17 words are 
generated for the lagged-Fibonacci generator. In stage 4 (line 10 - line 18) , the 
LFSR and the lagged-Fibonacci generator interact with each other for 34 times. 
A major goal for the interaction is to make it difficult to gain information about 
the state of the LFSR from the state of the lagged-Fibonacci generator and vice 
versa. For this purpose, an index register A is introduced, which has £[1] as 
the initial value. At the end of each run of the LFSR and the lagged-Fibonacci 
generator run, a pointer index is computed according to the most significant 4 
bits of A, and then A is updated by the word B [index] . Following the update of A, 
B [index] is updated by A©S[1] and S[l] is updated by A+S^l] mod 2 32 . Through 
the register A , the states of the LFSR and the lagged-Fibonacci generator are 
not only related to each other but are also related to their previous states. For 
example, assume that the state of the LFSR is known at the end of stage 4. To 
obtain the previous state of the LFSR, we have to know the content of A , which 
is derived from the previous state of the lagged-Fibonacci generator. In stage 5, 
the least significant bit of B [17] is set to 1 in order to ensure that not all of the 
17 words of B are even. At the end of the computation, the states of the LFSR 
and the lagged-Fibonacci generator are output as the master key. 

In addition to the master key generation, SSC2 supplies an optional service 
of generating a key for every frame. The key for a frame is used to re-load the 
LFSR and the lagged-Fibonacci generator when the frame is encrypted. The 
purpose of frame-key generation is to cope with the synchronization problem. 
In wireless communications, there is a high probability that packets may be lost 
due to noise, or synchronization between the mobile station and the base station 
may be lost due to signal reflection, or a call might be handed off to a different 
base station as the mobile station roams. When frames are encrypted with their 
individual keys, the loss of a frame will not affect the decryption of subsequent 
frames. 

Assume that each frame is labeled by a 32-bit frame number that is not 
encrypted. Let K n denote the frame key of the n-th frame. The frame key gene- 
ration should satisfy two fundamental requirements: (1) it is fast; and (2) it is 
difficult to gain information about Kj from Kj when i ^ j . Taking into conside- 
ration of the two requirements, we design a scheme that generates K n from the 
master key K master and the frame number n. To generate different keys for dif- 
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ferent frames, we divide the 32-bit frame number into 8 consecutive blocks and 
have each block involve in the frame key generation. Let no,ni, . . . , ri7 denote 
the 8 blocks of n, where no is the least significant 4 bits of n and n 7 is the most 
significant 4 bits of n. The frame key generation is illustrated by the following 
pseudo-code: 

Frame-Key-Generation K n ( K master ) 

1 load K master into S and B 

2 for j •<— 0 to 3 do 

3 for * •<— 0 to 7 do 

4 S[l] 5[1] + B [1 + (* + m mod 16)] mod 2 32 

5 5[2] 5(2] + B [1 + (8 + i + rii mod 16)] mod 2 32 

6 run the linear feedback shift register once 

7 B[17 — (z + 8 j mod 16)] 5[1] © B[17 — (i + 8 j mod 16)] 

8 i i — i 1 

9 j j + 1 

10 B[17] «— B[17] V 1 

11 return S and B 

The frame key generation consists of two loops. Corresponding to each rii,0 < 
i <7, the inner-loop (line 4 - line 8) selects two words from the buffer B to update 
the contents of .S'[l] and 5 [2]. Then the LFSR in executed and the output word 
is used to update one word of B. The outer-loop executes the inner-loop 4 times. 
Assume that n and n' are two different frame numbers. After the first run of 
the inner-loop, some words in S and B will be different for n and n! . Subsequent 
runs are used to produce more distinct words in S and B. 



Table 1 . Throughput of SSC2 



Machine 


Size 


Clock rate 
(MHz) 


Memory 

(Mbyte) 


OS 


Compiler 


Throughput 

(Mbits/s) 


Sun SPARC 2 


32 


40 


30 


Sun OS 


gcc -03 


22 


Sun Ultra 1 


32 


143 


126 


Sun Solaris 


gcc -03 


143 


PC 


16 


233 


96 


Linux 


gcc -03 


118 



7 Performance 

We have run SSC2 on various platforms. Table 4.1 illustrates the experimental 
results derived from running the ANSI C code listed in Appendix 1. Key setup 
times are not included in Table 1. On a 16-bit processor (233MHz cpu), the time 
for the master key generation is approximately equal to the encryption time 
for one CDMA frame (384 bits in 20 ms duration), and the time for the frame 
key generation is about one-twentieth of the encryption time for one CDMA 
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frame. Suppose that the CDMA phones have the 16-bit processor and the average 
conversion takes 3 minutes. Then the total time for frame key generations is 
about 2 ms. Hence, the overhead introduced by key setup is nearly negligible. 

8 Conclusion 

SSC2 is a fast software stream cipher portable on 8-, 16-, and 32-bit processors. 
All operations in SSC2 are word-oriented, no complex operations such as mul- 
tiplication, division, and exponentiation are involved. SSC2 has a very compact 
structure, it can be easily remembered. SSC2 does not use any look-up tables 
and does not need any pre-computations. Its software implementation requires 
very small memory usage. SSC2 supports variable private key sizes, it has an 
efficient key scheduling scheme and an optional frame key scheduling scheme. Its 
keystream sequence has large period, large linear complexity and small correla- 
tion to the component sequences. SSC2 is one of the few software stream ciphers 
whose major cryptographic properties have been established. 
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Appendix 1 

The following is the ANSI C code for the keystream generator of SSC2. The 
code for key setup is not included. 

unsigned long int R1 , R2, R3, R4, B[18], output, tempi, temp2; 
int c, s=5, r=17 ; 

tempi = R2 A (R3«31) A (R4»l) ; 

R4 = R3; 

R3 = R2 ; 

R2 = Rl; 

R1 = tempi ; 

tempi = B [r] ; 
temp2 = B[s] + tempi; 

B [r] = temp2; 

if (— r == 0) r = 17; 

if (— s == 0) s = 17; 

output = ( (templ>>16) A (templ<<16) )+B [( ( (temp2»28)+s) & 0xf)+l] ; 

tempi = (R4 I 0x1) + Rl ; 
c = (tempi < Rl) ; 

temp2 = (templ«16) A (templ>>16) ; 
if (c) { 

tempi = (R2 A (R4 I 0x1)) + temp2; 

} else { 

tempi = R2 + temp2; } 
c = (tempi < temp2) ; 

output = (c + (R3 A R2) + tempi) A output; 
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Abstract. We discuss the special requirements imposed on the under- 
lying cipher of systems which encrypt each sector of a disk partition in- 
dependently, and demonstrate a certificational weakness in some existing 
block ciphers including Bellare and Rogaway’s 1999 proposal, proposing 
a new quantitative measure of avalanche. To address these needs, we pre- 
sent Mercy, a new block cipher accepting large (4096-bit) blocks, which 
uses a key-dependent state machine to build a bijective F function for 
a Feistel cipher. Mercy achieves 9 cycles/byte on a Pentium compatible 
processor. 

Keywords: disk sector, large block, state machine, avalanche, Feistel 
cipher. 

Mercy home page: http://www.cluefactory.org.uk/paul/mercy/ 



1 Introduction 

Disk sector encryption is an attractive approach to filesystem confidentiality. Fi- 
lesystems access hard drive partitions at the granularity of the sector (or block) 
where a sector is typically 4096 bits: read and write requests are expressed in 
sector numbers, and data is read and modified a sector at at a time. Disk sector 
encryption systems present a “virtual partition” to the filesystem, mapping each 
sector of the virtual partition to the corresponding sector, through an encrypting 
transformation, on a physical disk partition with the same disk geometry. The 
performance is typically better than file-level encryption schemes, since every 
logical sector read or write results in exactly one physical sector read or write, 
and confidentiality is also better: not only are file contents obscured, but also 
filenames, file sizes, directory structure and modification dates. These schemes 
are also flexible since they make no special assumptions about the way the file- 
system stores the file data; they work equally well with raw database partitions 
as with filesystems, and can be transparently layered underneath disk caching 
and disk compression schemes. Linux provides some support for such filesystems 
through the “/dev/loopO” filesystem device. 

The stream cipher SEAL EH is well suited to this need. SEAL provides a 
strong cryptographic PRNG (CPRNG) whose output is seekable. Thus the entire 
disk can be treated as a single contiguous array of bytes and XORred with the 
output from the CPRNG; when making reads or writes of specific sectors the 
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appropriate portion of the output can be generated without the need to generate 
the preceding bytes. The same effect can be achieved, somewhat less efficiently, 
by keying a CPRNG such as ARCFOUR |l(l| with a (key, sector number) pair 
and generating 512 bytes with which to encrypt the sector. These schemes are 
highly efficient and provide good security against an attacker who seizes an 
encrypted hard drive and attempts to gain information about its contents. 

However, this is not strong against attackers with other channels open to 
them. They may have user privileges on the system they’re trying to attack, and 
be able to access the ciphertext stored on the hard drive at times when it’s shut 
down. Or they may try to modify sectors with known contents while carrying a 
drive from place to place. They may even be able to place hardware probes on 
the drive chain while logged in as a normal user, and sniff or modify ciphertext. 
Against these attacks, SEAL and ARCFOUR (used as described) are ineffective. 
For example, an attacker can write a large file of all zeroes and thereby find the 
fixed encryption stream associated with many sectors; once the file is deleted, 
the sectors might be re-used by other users with secure data to write, and this 
data is easily decrypted by XORing with the known stream. Or, if attackers can 
make a guess of the plaintext in a given sector, they can modify this to another 
plaintext of their choosing while they have access to the drive by XORing the 
ciphertext with the XOR difference between the two plaintexts. 

File-based encryption schemes defeat these attacks by using a new random 
IV for each new plaintext and authenticating with a MAC. However, applying 
these techniques directly to sector encryption would require that the ciphertext 
for each sector be larger than the plaintext, typically by at least 64 bytes. Thus 
either the plaintext sectors would need to be slightly smaller than the natural 
hardware sector size, harming performance when mapping files into memory 
(and necessitating a thorough re-engineering of the filesystem code) or auxiliary 
information would have to be stored in other sectors, potentially adding a seek 
to each read and write. In either case the size overhead will be about 1.5 - 3.1%. 
It’s worth investigating what can be achieved without incurring these penalties. 

SFS @ uses a keyless mixing transformation on the plaintext before applying 
a block chaining stream cipher. This greatly reduces the practical usefulness 
of many such attacks, but it falls short of the highest security that pure disk 
sector encryption systems can aspire to: that the mapping between each virtual 
and physical disk sector appears to be an independent random permutation to 
an attacker who expends insufficient computation to exhaustively search the 
keyspace. In other words, the theoretical best solution under these constraints 
is a strong randomised large block cipher. 

Several proposals exist for building large block ciphers from standard crypto- 
graphic components such as hash functions and stream ciphers flUEl; however, 
these are not randomised ciphers, and as Section 0 shows, they have certificatio- 
nal weaknesses. More seriously, no proposal comes close to offering the perfor- 
mance needed: bit rates equal to or better than disk transfer rates. Since small 
improvements in disk access efficiency can mean big improvements to every part 
of the user experience, and since performance considerations are one of the main 
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reasons why filesystem encryption is not more widely used, it seems worthwhile 
to develop a new cipher designed specifically to meet this need. 

The rest of this paper is organised as follows. Section |3| describes a certifica- 
tional weakness applicable to several existing classes of large block cipher, and 
proposes a quantitative measure of avalanche based on the attack. Section 0 
lays out the design goals for the solution here, a new block cipher Mercy with a 
native block size of 4096 bits, and Section 0] describes the cipher itself. Section 
El discusses the design of the cipher in detail with reference to the performance 
and security goals of Section 0 Finally Section El discusses some of the lessons 
learned in the design of Mercy. 

2 Avalanche and Certificational Weaknesses 

m presents an attack on bidirectional MACs based on inducing collisions in 
the internal state of the MAC. This attack can be extended to show a certifica- 
tional weakness in some large block cipher proposals. Note that neither keys nor 
plaintext can be recovered using this attack; it merely serves to distinguish the 
cipher from a random permutation. 

In general form, the attack proceeds as follows. The bits in the plaintext 
are divided into two categories, “fixed” and “changing” ; a selection of the bits 
of the ciphertext are chosen as “target” bits. A number of chosen plaintexts 
are encrypted, all with the same fixed bits and with changing bits chosen at 
random; the attack is a success if a collision in the target bits of the ciphertext 
is generated. Let Wk be the length of the key, w t the number of target bits, and 
2 Wp be the number of different plaintexts encrypted: if the following conditions 
are met: 

— the result is statistically significant, ie the probability of seeing such a 
collision under these circumstances from a genuine PRP (approximately 
2 2w p~ w t-i) is small; and 

— w p < Wk — 1, ie the work factor for the attack is less than that for a key 
guessing attack against the cipher 

then a certificational weakness has been demonstrated. The attack works by 
inducing an internal collision in the data path from the changing bits to the 
fixed bits; the width of this path determines the number of plaintexts needed and 
thus the smallest w p for which the attack can work provides a useful quantitative 
measure of avalanche. This attack can easily be converted to use the memory- 
efficient parallel collision finding techniques of EH. so memory usage does not 
present a serious obstacle to the practicality of the attack if 2 Wp adaptive chosen 
plaintexts can be encrypted. 

This attack may be applied to 0| by choosing the first two blocks of the 
plaintext as the “changing” bits, and all of the output except the second two 
blocks as the “target” bits. If the blocksize of the underlying cipher is 64 bits, 
then 2 33 chosen plaintexts should be sufficient to induce a collision in <r, resulting 
in a collision in all the target bits as desired; if it is 128 bits, 2 65 will be needed. 
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This attack also extends to bidirectional chaining systems, in which the plain- 
text is encrypted first forwards and then backwards using a standard block cipher 
in a standard chaining mode; in this case, the first two blocks are the changing 
bits, all of the ciphertext except the first two blocks are the target bits, and 
the number of plaintexts required are as before; the collision is induced in the 
chaining state after the first two blocks. If the chaining mode is CBC or CFB, 
all of the output except the first block will be target bits, since a collision in the 
internal state after the second block means that the second block of ciphertext 
is identical. 

Note that this attack is not applicable to any of the proposals in [Tj ; neither 
BEAR nor LION claim to be resistant to any kind of chosen plaintext attack, 
while LIONESS carries 256 or 320 bits of data between the two halves (depen- 
ding on the underlying hash function), which would require 2 129 or 2 161 chosen 
plaintexts; this is outside the security goals of the cipher. However, it can be 
applied to BEAST from HH by inducing a collision in the SHA-1 hash of R** 
with 2 81 chosen plaintexts; the changing bits are the first 160 bits of R** , and the 
target bits are all of the ciphertext except the first 160 bits of T ** . This attack 
is clearly impractical at the moment but it violates our expectation that the 
cheapest way to distinguish a block cipher from a random permutation should 
be a brute force key guessing attack. 

3 Mercy Design Goals 

Mercy is a new randomised block cipher accepting a 4096-bit block, designed 
specifically for the needs of disk sector encryption; it achieves significantly higher 
performance than any large block cipher built using another cipher as a primitive, 
or indeed than any block cipher that I know of large or small. 

It accepts a 128-bit randomiser; it is expected that the sector number will 
be used directly for this purpose, and therefore that most of the randomiser 
bits will usually be zero. This is also known as a “diversification parameter” 
in the terminology of p, or “spice” in that of m- This last term avoids the 
misleading suggestion that this parameter might be random and is convenient 
for constructions such as “spice scheduling” and “spice material” and is used 
henceforth. 

Mercy’s keyschedule is based on a CPRNG; the sample implementation uses 
ffflj . Though [IIP] takes a variable length key, Mercy does not aspire to better 
security than a cipher with a fixed 128-bit key size, so it’s convenient for the 
purposes of specifying these goals to assume that the key is always exactly 128 
bits. 

— Security: Any procedure for distinguishing Mercy encryption from a se- 
quence of 2 128 independent random permutations (for the 2 128 possible spi- 
ces) should show no more bias towards correctness than a key guessing attack 
with the same work factor. However, we do not claim that ignorance of the 
spice used would make any attack harder; it’s not intended that the spice 
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be hidden from attackers. For this reason, Mercy is not intended to be a 
K-secure or hermetic cipher in the terminology of (2J- 

— Resistance to specific attacks: Mercy is designed to be resistant in parti- 
cular to linear and differential attacks, as well as to avoid the certificational 
weakness of Section Q 

— Speed: Encryption and decryption should be much faster than disk transfer 
rates, even with fast disks and slow processors. Specifically, they should be 
faster than 20 Mbytes/sec on a relatively modest modern machine such as 
the author’s Cyrix 6x86MX/266 (which has a clock frequency of 233 MHz). 
This translates as under 11.7 cycles/byte, within the range of stream ciphers 
but well outside even the fastest traditional block cipher rates. The current C 
implementation of Mercy achieves 9 cycles/byte; it is likely that an assembly 
implementation would do rather better. 

— Memory: The cipher should refer to as little memory as possible, and cer- 
tainly less than 4kbytes. In many environments, Mercy’s keytables will be 
stored in unswappable kernel memory; more important however is to mini- 
mise the amount of Level 1 cache that will be cleared when the cipher is 
used. 1536 bytes of storage are used. 

— Simplicity: Mercy is designed to be simple to implement and to analyse. 

— Decryption: Decryption will be much more frequent than encryption and 
should be favoured where there is a choice. 

4 Description of Mercy 

Since most of Mercy’s operations are based around 32-bit words, we define Z w = 
Vectors are indexed from zero, so a vector P € Z^f 8 of 128 32-bit numbers 
will be indexed as (P 0 , Pi, . . . , Pm)- The symbol © represents bitwise exclusive 
OR; where + appears in the figures with a square box around it, it represents 
addition in the ring Z w . Least significant and lowest indexed bytes and words 
appear leftmost and uppermost in the figures. 

Note that some details that would be needed to build a specification of Mercy- 
based file encryption sufficient for interoperability, such as byte ordering within 
words, are omitted since they are irrelevant for cryptanalytic purposes. 



4.1 T Box 

The T box (T : Z 256 — ► Z w , Figure QJ is a key-dependent mapping of bytes onto 
words. N represents multiplicative inverses in GF( 2 8 ) with polynomial base 
x 8 + x 4 + x 3 + x + 1 except that N(0) = 0. do . . . ^7 are key dependent bijective 
affine mappings on GF(2) 8 . 

3 

T(x) = J2^ l d2 t +i[N(d 2l [x})] 

i = 0 



54 



P. Crowley 




Fig. 1. T box, Operation M, Q state machine 



4.2 Operation M 



M : Z w — > Z w (Figure □]) is drawn from David Wheeler’s stream cipher WAKE 
EH; it’s a simple, key-dependent mapping on 32-bit words. The most significant 
byte of the input word is looked up in the T box, and the output XORred with 
the other three bytes shifted up eight bits; the construction of the T box ensures 
that this mapping is bijective. 

M(x) = 2 8 x © T ( [x/2 24 J ) 



4.3 Q State Machine 



The Q state machine (Figure^ maps a four-word initial state and one word input 
onto a four-word final state and one word output ( Q : Zf^ x Z w -> Zi x Z w ) 
using taps from a linear feedback shift register and a nonlinear mixing function. 

Q(S,x) = ((S 3 ®y,S 0 ,Si,S 2 ),y) where y = S 2 + M(S 0 + x) (1) 



4.4 F Function 



The F n function (n > 8; Figure EJ accepts a 128-bit spice G £ Z* and a 32n-bit 
input A £ Z™ and generates a 32n-bit output B £ Z™ (F n : Z™ x Z^ — > Z™). F 64 
(usually just F) is the F function for the Feistel rounds. Here S'o ... n +4 represents 
successive 128-bit states of a state machine; C/o ... n +3 G Z w are the successive 
32-bit inputs to the state machine, and Vo...n +3 £ Z w are the outputs. 
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F n (A,G) = B where 

So — (A n _4, A n _3, ^n— l) 

(S i+1 ,Vi) = Q(Si,Ui) (0 < * < n + 4) 
(G t (0<i< 4) 

Ui = < Ai +n _ 12 (4 < * < 8) 

[ (8 < i < n) 

B = f V) +8 (i < n — 4) 

1 \ 5' Jl +4,i+4-n (n - 4 < i < n) 




© 

spice 



input 



4.5 Round Structure 

Mercy uses a six round Feistel structure (Figure 0) with partial pre- and post- 
whitening; unusually, the final swap is not omitted. The spice G £ (usually 
the sector number) goes through a “spice scheduling” procedure, analogous with 
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Fig. 3. Round structure (decryption illustrated) 



key scheduling, through which the “spice material” G' £ Z is generated based 
on the input spice, using the F 24 variant of the F function; this forms six 128- 
bit “round spices”. Spice scheduling uses a dummy input to the F function; for 
this a vector of incrementing bytes H £ Z ^ is used. P £ Z^ 8 represents the 
plaintext, C £ Z ^ 28 the ciphertext, and W 0 ,Wi £ Z the whitening values. 
We describe decryption below; since Mercy is a straightforward Feistel cipher 
encryption follows in the straightforward way. 

3 

ff < = Z]2 8, ‘(4* + j) 

3 = 0 

G' = F 2 i {H,G ) 

(-bo, -Ro) = (^0. ..63, C64...127 © Wl) 

(Ri+l, Ri+l) = (Rj,-bj © F64(Ri,G f 4 i 4i+3 )) (0 < i < 6) 

(bo. ..63, b 3 64...127) = (Lq © Wo, R&) 

4.6 Key Schedule 

The key is used to seed a CPRNG from which key material is drawn; m is 
used in the sample implementation (after discarding 256 bytes of output), and 
is convenient since it’s small and byte oriented, but any strong CPRNG will 
serve. Then the procedure in Figure 0 generates the substitutions do. ..7 and the 
2048-bit whitening values Wq . W-\ . 
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for i <— 0 ... 7 do 

di[0] «— random byte() 
for j «— 0 ... 7 do 

do 

r «— random byte() 
while r £ di[0 . . . 2-' — 1] 
for k <— 0 . . . 2 J — 1 do 

di[k + 2 j ] «— di[k\ © r © di[0] 
for i 0 ... 1 do 
for j 0 . . . 64 do 

Wij <- 0 

for k -f- 0 . . . 3 do 

Wi,j <- Wij + 2 8fc random_byte() 



Fig. 4. Key schedule pseudocode 



An expected 10.6 random bytes will be drawn for each d. Once do. ..7 have 
been determined, a lk table representing the T box can be generated. During 
normal use 1536 bytes of key-dependent tables are used. 

5 Design of Mercy 

Existing approaches to large block ciphers use a few strong rounds based on 
existing cryptographic primitives. These ciphers cannot achieve speeds better 
than half that of the fastest block ciphers [2J or a third of the fastest stream 
ciphers PJ. Current block cipher speeds don’t approach the needs of the design 
goals even before the extra penalties for doubling up, while those solutions based 
on stream ciphers pay a heavy penalty in key scheduling overhead that puts their 
speeds well below those needed. 

Mercy addresses this by using more rounds of a weaker function. This makes 
more efficient use of the work done in earlier rounds to introduce confusion in 
later rounds, and is closer to a traditional block cipher approach drawn across 
a larger block. It also draws more state between disparate parts of the block, 
protecting against the certificational weaknesses identified in Section 0 



5.1 Balanced Feistel Network 

Balanced Feistel networks are certainly the best studied frameworks from which 
to build a block cipher, although I know of no prior work applying them to 
such large block sizes. They allow the design of the non-linear transformations 
to disregard efficiency of reversal and provide a familiar framework by which 
to analyse mixing. Balanced networks seem better suited to large block ciphers 
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than unbalanced networks, since an unbalanced network is likely to have to do 
work proportional to the larger of the input and output data width. 

Feistel ciphers normally omit the final swap, so that decryption has the same 
structure as encryption. However, Mercy implementations will typically encrypt 
blocks in-place, and the cost of having an odd number of swaps (forcing a real 
swap) would be high, so the last swap is not omitted. 

Mercy’s round function, while weaker than those used in [Q, is considerably 
stronger than that of traditional Feistel ciphers, necessitating many fewer rounds. 
The larger block size allows more absolute work to be done in each round, while 
keeping the work per byte small. 



5.2 Key Schedule and S-Boxes do ...7 

The function N used in building the T box is that used for nonlinear subsitution 
in 0 ; it is bijective and has known good properties against linear and differential 
cryptanalysis, such as low LCmax and DCmax in the terminology of m We use 
this function to build known good key-dependent bijective substitutions using 
an extension of the technique outlined in |3J; however, rather than a simple 
XOR, the d mappings before and after N are drawn at random from the entire 
space of bijective affine functions on GF{ 2) 8 , of which there are approximately 
2 70 ’ 2 , by determining first the constant term d[0] and then drawing candidate 
basis values from the CPRNG to find a linearly independent set. Because the d 
functions are affine, the LCmax and DCmax of the composite function dioNodo 
(and siblings) will be the same as those of N itself. The composite functions will 
also be bijective since each of the components are, and hence all the bytes in 
each column of the T table will be distinct. 

However, there are fewer possible composite functions than there are pairs 
do,di. In fact for each possible composite function, there are 255 x 8 = 2040 
pairs do,d± which generate it. This follows from the following two properties of 
N: 



Va G GF( 2 s ) - {0} : Vcc G GF( 2 s ) : aN{ax) = N{x ) 

V5 G 0 . . . 7 : Vx G GF{ 2 8 ) : N(x 2j ) = N(i r) 2 ' 

Since both multiplication and squaring in GF( 2 8 ) are linear (and hence affine) in 
GF( 2) 8 (squaring because in a field of characteristic 2, ( x + y ) 2 = x 2 + xy + yx + 
y 2 = x 2 + y 2 ), both of these properties provide independent ways of mapping 
from any pair do,d\ to pairs which will generate the same composite function, 
distinct in every case except (a, b) = (1, 0). Taking this into account, the number 
of distinct composite functions possible is approximately 2 1294 , and there are 
24613.7 f lmc tionally distinct keys in total (considering Wq,Wi as well as T). 

Little attention has been paid to either the time or memory requirements 
of the key schedule, since key scheduling will be very infrequent and typically 
carried out in user space. 



